Workflows

Finding Data to Index: Data available through a consortium

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. This blog post will talk about Data Availability Statements that list the data as available through request to a consortium or a committee.

In the bibliographic data we pulled from PMC, approximately 2% of the journal articles listed the data as available by request through a consortium or a committee. What this means is that another researcher could request the data from a third party that is charged with managing re-use requests for the data. You can see an example of a data availability statement for such a dataset below:

Example Data Availability Statement

Example Data Availability Statement

While some of the data used in this example is available through the Gene Ontology Database (GEO), additional data from the Bipolar Disorders Working Group of the Psychiatric Genomics Consortium was used. In this situation, the authors applied for access to data from the Psychiatric Genomics Consortium, and through the Data Availability Statement, are instructing readers on where they may obtain the data for re-use. The Psychiatric Genomics Consortium, in turn, provides detailed documentation of the data.

The several modes of access described in Data Availability Statements, data that is available through a consortium or committee is one of the rarer options. Many of these datasets are large genomic datasets or datasets generated by city, state, or national entities (e.g., the New York City Department of Education). Because of the size of the dataset and the pre-existing infrastructure for re-use, many of these datasets have already helped researchers generate new research and have resulted in multiple publications. For the example, the NYU Data Catalog includes a record for the dataset collected during the Counseling African Americans to Control Hypertension (CAATCH) Trial which is governed by a Study Oversight Committee. This dataset has already generated at least 6 separate journal articles

The benefit of cataloging these types of datasets is that they were previously only discoverable through serendipity, or through referral by a colleague. When datasets available through a consortium or committee are indexed in the data catalog, the work that these committees have already done to make their data re-usable will be more visible to the public, and researchers will be more likely to re-use these large and valuable datasets.

Finding Data to Index: Data found in Supporting Information files

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. Second in this series, this post discusses articles that state all relevant data are included in the Supporting Information files.

Finding Data to Index: Experimenting with PMC

Since the early days of the Data Catalog, we have experimented with different ways to locate institutional datasets suitable for indexing. Recently, with the help of the folks at the National Library of Medicine (NLM), we were able to create a new workflow for locating data. In a series of blog posts, we will be writing about our experiences using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets as well as what we learned about our institution’s data sharing practices from this exercise.

This series will be split into five blog posts:

  1. This introductory post

  2. An examination of data found in supporting information files

  3. An examination of data available through a consortium

  4. An examination of data available through a repository

  5. Additional findings about our researchers and their data

In April 2018, NLM announced new search filters for PubMed and PMC. The “has data avail” filter allows users to narrow their search to journal articles that have data availability statements. Using that filter, we were able to limit our search to journal articles that included data availability statements and had at least one author affiliated with NYU. Our search strategy is listed below:

has data avail[filter] AND ((nyu langone school of medicine[ad]) OR (new york university langone school of medicine[ad]) OR (langone school of medicine[ad]) OR (New york univ School of Medicine[ad]) OR (Nyu School of Medicine[ad]) OR (New York University School of Medicine[ad]) OR (langone medical center[ad]) OR (nyu medical center[ad]) OR (new york university medical center[ad]) OR (new york university langone[ad]) OR (langone health[ad]) OR (NYU Langone[ad]) OR (nyulmc[ad]) OR (nyumc[ad]) OR (NYU Medical School[ad]) OR (New York University Medical School[ad]) OR (hospital for joint disease[ad]) OR (hospital for joint diseases[ad]) OR (harkness center for dance[ad]))

Once we identified the total number of articles that fit our criteria, our developer (whose other work you can read about here) pulled all 517 results into a spreadsheet for us to review. Together, we worked to identify articles that included viable datasets that could be indexed in the Data Catalog.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of  Staphylococcus aureus.  Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of Staphylococcus aureus. Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

The purpose of this exercise was to explore what we could learn about our institution, our researchers, and their data. By examining each data availability statement and investigating the information provided, we were able to categorize data availability statements into four discrete groups:

  • Data available by emailing the author

  • Data available by applying to a consortium

  • Data available through a repository

  • Data available in supporting information files

This approach also allowed us to locate researchers who had not, or were no longer, complying with publishers’ open data requirements.

This exploration of the “has data available” filter demonstrates the heterogeneity of data practices in biomedical research which in turn demonstrates with the flexibility of the Data Catalog is imperative. By pointing researchers to other resources and not requiring them to upload their data, the Data Catalog can accommodate the wide variety of ways that researchers choose to make their data available.

Cataloging EHR Data: Experiences at NYU Langone Health

The implementation of Electronic Health Record (EHR) systems has allowed researchers to leverage clinical data for research purposes. At NYU Langone Health, researchers are able to work with administrators to pull data from the EHR system and study the patient population of NYU Langone Health as well as the health care services offered here. To assist in this process, the NYU Health Sciences Library began cataloging this data in the NYU Data Catalog.