This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. Second in this series, this post discusses articles that state all relevant data are included in the Supporting Information files.
Since the early days of the Data Catalog, we have experimented with different ways to locate institutional datasets suitable for indexing. Recently, with the help of the folks at the National Library of Medicine (NLM), we were able to create a new workflow for locating data. In a series of blog posts, we will be writing about our experiences using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets as well as what we learned about our institution’s data sharing practices from this exercise.
This series will be split into five blog posts:
This introductory post
An examination of data found in supporting information files
An examination of data available through a consortium
An examination of data available through a repository
Additional findings about our researchers and their data
In April 2018, NLM announced new search filters for PubMed and PMC. The “has data avail” filter allows users to narrow their search to journal articles that have data availability statements. Using that filter, we were able to limit our search to journal articles that included data availability statements and had at least one author affiliated with NYU. Our search strategy is listed below:
has data avail[filter] AND ((nyu langone school of medicine[ad]) OR (new york university langone school of medicine[ad]) OR (langone school of medicine[ad]) OR (New york univ School of Medicine[ad]) OR (Nyu School of Medicine[ad]) OR (New York University School of Medicine[ad]) OR (langone medical center[ad]) OR (nyu medical center[ad]) OR (new york university medical center[ad]) OR (new york university langone[ad]) OR (langone health[ad]) OR (NYU Langone[ad]) OR (nyulmc[ad]) OR (nyumc[ad]) OR (NYU Medical School[ad]) OR (New York University Medical School[ad]) OR (hospital for joint disease[ad]) OR (hospital for joint diseases[ad]) OR (harkness center for dance[ad]))
Once we identified the total number of articles that fit our criteria, our developer (whose other work you can read about here) pulled all 517 results into a spreadsheet for us to review. Together, we worked to identify articles that included viable datasets that could be indexed in the Data Catalog.
The purpose of this exercise was to explore what we could learn about our institution, our researchers, and their data. By examining each data availability statement and investigating the information provided, we were able to categorize data availability statements into four discrete groups:
Data available by emailing the author
Data available by applying to a consortium
Data available through a repository
Data available in supporting information files
This approach also allowed us to locate researchers who had not, or were no longer, complying with publishers’ open data requirements.
This exploration of the “has data available” filter demonstrates the heterogeneity of data practices in biomedical research which in turn demonstrates with the flexibility of the Data Catalog is imperative. By pointing researchers to other resources and not requiring them to upload their data, the Data Catalog can accommodate the wide variety of ways that researchers choose to make their data available.
what is team science?
Team science is a collaborative effort to address scientific challenges that leverage the strengths and expertise of professionals trained in different fields. One of the overarching goals of the Clinical and Translational Science Awards (CTSA) given to select institutions is to promote team science through establishing mechanisms by which biomedical researchers can collaborate, be trained in why team science is important, and develop evaluation measures to assess teamwork in biomedical research contexts.
about the award
Last week, the Data Catalog Collaboration Project (DCCP) found out that they had received an award from the CTSA Great Team Science Contest, which asked CTSA-funded hubs to submit examples of team science successes to be evaluated by a review panel and presented at the fall meeting. Each application was scored based on a number of categories: overall score, top importance, top innovation, top impact, among others. 170 applications were submitted, and the DCCP received the highest score for the Top Importance category. I was able to present the topic at the Fall CTSA Program Meeting where I could discuss the value of the data catalog approach to leaders in biomedical translational research. The people I spoke to were most interested in how the data catalog can help them make disparate, hard to find research datasets that are spread out and stored in various places across their institution more discoverable using a single system.
Expanding our reach beyond libraries
From our perspective, the most exciting part about receiving this award was that our approach of having libraries implement local data catalogs, establishing collaborations between librarians and developers to improve data discovery, fostering partnerships with our local institutional research initiatives, and making concerted efforts to reduce the barrier on the research community to share was seen as the most important project by a community that expands well beyond the realm of libraries. This is a considerable achievement because the other projects that were submitted were very strong in addressing a diverse range of team science initiatives. The DCCP has long been an advocate of ensuring that institutional research data is discoverable, available and usable regardless of where it is stored, and this award is an acknowledgement that the broader biomedical research community agrees.
The DCCP has grown to 8 libraries in total working to improve institutional data discovery, and this award can serve as evidence of its value to libraries or broader institutions interested in improving their data discovery needs. The DCCP members all provide a great service to their institution, and to the other libraries participating in this effort. If you are interested in being a part of this effort, please reach out to us.
Since the launch of the UMB Data Catalog one year ago, the HS/HSL’s DC team has worked to create records reflecting the diversity of studies undertaken by researchers in the University’s schools of medicine, nursing, pharmacy, dentistry, and social work. As a result, the datasets curated thus far range from foster parent experiences with Maryland’s court process to a collection of pharmaceutical clinical study reports on neuraminidase inhibitors for treating influenza to pre-ART HIV care outcomes in adults in Kenya and Tanzania. Working with our researchers has been a rewarding experience which, on occasion, has led to unexpected discoveries.
Recently, through correspondence with a faculty member in the school of medicine, we learned of the existence of a highly specialized resource. Since 1993 researchers in the SOM’s Division of Endocrinology, Diabetes and Nutrition have been studying the Old Order Amish (OOA) community in Lancaster County, Pennsylvania. The current population of approximately 35,000 individuals is directly descended from a small number of Anabaptist Christian founder families who immigrated to the United States in the late 1700’s. Their simplistic lifestyle and common lineage are ideally suited for epidemiological, genetic, and other health-related investigations. The Amish Complex Genetic Disease Database is the result of over 2 decades of research associated with this distinctive population. Data has been compiled from over 7,000 volunteers who have participated in one or more studies in a variety of investigational areas including diabetes, longevity, whole genome sequencing, blood pressure, and osteoporosis. Funded by a series of grants, work with the Amish over the last 20+ years has resulted in more than 200 papers the majority of which utilize the information accumulated in the database.
This is a prime example of a unique resource the significance of which is now highlighted by the metadata in its descriptive record. Through the UMB Data Catalog the visibility of the Amish Complex Genetic Disease Database has been increased providing opportunities for data re-use and future collaborations.
The implementation of Electronic Health Record (EHR) systems has allowed researchers to leverage clinical data for research purposes. At NYU Langone Health, researchers are able to work with administrators to pull data from the EHR system and study the patient population of NYU Langone Health as well as the health care services offered here. To assist in this process, the NYU Health Sciences Library began cataloging this data in the NYU Data Catalog.
Guest post by Ian Lamb, Senior Developer at NYU Health Sciences Library
Early in the development of the data catalog at NYU, one of our team members floated the idea of publishing the dataset records in a Linked Open Data (LOD) format. As the developer on the project, the implementation of this idea would fall to me, so I did what every good developer does in this situation: pretend to know what’s going on, and then Google it when the meeting’s over.
At the time, LOD had not seen much uptake in the tech world, which would explain my ignorance, but libraries, museums, and other institutions were interested in the idea. The phrase refers to data that is “linked” to other data, in a machine-readable way, so that machines can begin to understand what our webpages are really about. For instance, if we have a dataset that was published by the US Department of Energy, we probably already state that in a human-readable way, i.e. in a sentence on the page. But it takes a great deal of effort for a computer to figure out what that means, requiring the use of natural language processing and other artificial intelligence algorithms. LOD allows us to explain this relationship to a computer in a much more direct fashion.
Using a standardized data interchange format (such as JSON-LD), along with a formal schema such as those from Schema.org, we can explicitly link this dataset to the Department of Energy. And when this dataset becomes linked to the US Department of Energy, it also becomes linked to a lot of other information – where the Department of Energy headquarters is, who runs it, other types of research it may produce, where its funding comes from, etc. A machine can follow these relationships and start to form a picture of what the Department of Energy really is, and how our dataset fits into the big picture.
This network of structured, linked data underpins the Semantic Web – a proposed new evolution of the web in which the whole internet would be imbued with meaning in this way. The Semantic Web was first proposed in 2001 in an article by Tim Berners-Lee and colleagues, and the phrase Linked Open Data followed a few years later. But while these concepts have been around for decades, there are still relatively few companies or institutions that publish LOD, and even fewer that utilize it. The Wikipedia article on the Semantic Web has a whole section explaining some reasons for this. Indeed, the most mainstream use of JSON-LD up to this point has been to improve search engine rankings and alter the appearance of a search result in Google.
So imagine my surprise recently, when Google quietly rolled out a search engine for datasets that relies almost entirely on data they’ve harvested from repositories and data catalogs using JSON-LD. Apparently Google has not given up on linked data. They’re just taking their time with it.
Dataset Search, as it is humbly called, is still in beta, and there do appear to be some quirks in the metadata it displays. Take the Asian American Partnership in Research and Empowerment dataset, for example. Here is the record as it appears on our data catalog, and here it is in Google’s search. To figure out why the search result appears so sparse, I fed it into Google’s Structured Data Testing Tool, which shows us the dataset as Google sees it based on our linked data output. (The JSON-LD itself is embedded in the page source, and you can see it by scrolling the left-hand frame down to line 39.)
The tool shows that Google is receiving lots of metadata about the dataset itself, including authors, keywords, associated citations, file formats, and more. It also shows the various relationships we’re specifying for it: the dataset record isPartOf a data catalog, the provider of which is the NYU Health Sciences Library, which has a parentOrganization of the NYU School of Medicine, and the url for that entity is…etc. But out of all this data, only the authors and the description currently appear on Google’s search result. There is a link to some citations there, but they’re not the ones we’ve specified – they appear to simply be searching for the name of the dataset in Google Scholar. Similarly, Google lists the dataset as provided by New York University, which, while correct in a sense, is not the entity we’ve specified as the provider or even the parent organization of the provider.
Some of this discrepancy is because we’ve chosen slightly different field names than Google has. For example, we are listing the file format as an encodingFormat attached to the main dataset entity, but using another Google tool, I’ve found that they expect any encodingFormats to be listed under a distribution, which would look like this:
“name”: “Asian American Partnership in Research and Empowerment”,
"description": "Project AsPIRE (Asian American Partnership in Research and Endowment) was a community-based…",
Other fields, such as Keywords, don’t appear to be supported at all. At this stage we can’t know if they’re going to be adding any additional fields or how much of this is going to change, but we have identified a few small alterations we can make to our JSON-LD output that should improve our results.
So there are some kinks to be worked out for sure, and it remains to be seen how much Google is going to invest in this effort and how aggressively they plan to publicize it. But it does seem to represent a significant move into this space. And if it does pick up steam, there is a very substantial upside for any institution using our catalog code: instant inclusion in a worldwide, highly-visible dataset search engine.
The Data Catalog Collaboration Project-Basic Science (DCCP-BS) is a working group with the objective of creating best practices for curating basic science-related records into the DCCP catalogs. Members of this DCCP subgroup include subject specialists, catalogers, and data/metadata librarians from the Universities of Pittsburgh, Maryland-Baltimore, and North Carolina.
The group formed after the realization that contributors from the various DCCP institutions were disparately interpreting field definitions of existing metadata entities when curating basic science-related data catalog entries. Upon reflection this is not surprising, as the original DCCP metadata schema focused on human subject datasets and didn’t sufficiently capture specific information affiliated with animal research and basic science datasets.
The data catalog metadata schema used by the DCCP was created at NYU by analyzing and comparing existing metadata schemas that focus on indexing research data, specifically:
Elements were selected from these schemas based on their relevance and applicability to the datasets described within the data catalog. One of our main goals was to make sure that our metadata could be transferred over to future national data discovery systems from the NIH and others, so that when these systems become available metadata transfer would be seamless. The existing DCCP metadata schema and documentation is available here. Since its creation, DCCP members have slowly been adapting and modifying the metadata to accommodate new types of datasets.
The DCCP-BS has begun by focusing on issues related to GEO records and will address other types of records as necessary. The GEO accession record example shown here highlights the metadata fields under discussion by the DCCP-BS.
Data Type -- Defined as “the type of data collected or created.” Current category list is limited to nine options, including Genetic/Genomic; DCCP-BS suggestion: addition of Genetic/Genomic sub-categories to capture more specific data types such as Microarray or Sequence Reads, as well as the flexibility to add more sub-categories as necessary.
Subject of Study -- Defined as “the (strain of the) species of the subject of the study.” Current metadata subfields are limited to Species and Strain; DCCP-BS suggestion: addition of Tissue/Cell Line as a subfield.
Equipment -- Defined as “the name, URL, and contextual information about equipment used to collect or create data.” Current data entry is free text; DCCP-BS suggestion: standardized equipment names/URLs/descriptions to be shared between DCCP institutions to ease workload and facilitate consistency.
Software -- Defined as “the name, URL, and contextual information about software used to collect, create, or analyze data.” Current data entry is free text; DCCP-BS suggestion: standardized software names/URLs/descriptions to be shared between DCCP institutions to ease workload and facilitate consistency, as well as the addition of a subfield for software Version.
Study Type -- Defined as “the type of study used to collect the data.” Current category list is limited to Observational and Interventional; DCCP-BS suggestion: addition of a category to capture “bench research” (e.g., Empirical) in addition to the current clinically-defined options.
All of the group’s suggestions on new metadata elements and their intended use will be reviewed by DCCP members and then added to the documentation if approved. This effort serves to improve metadata documentation for basic science datasets and will continue to evolve as we engage with researchers striving to make their datasets discoverable.
Have questions or comments? Leave them below or send to the DCCP-BS coordinator, Carrie Iwema (firstname.lastname@example.org).