Data Availability Series

Finding Data To Index: When the Data Availability Statement Leads Nowhere

This blog post is final part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. This blogpost is the last in the series and will discuss additional findings related to the bibliometric data we pulled from PMC.

Unsavory Researcher Behavior

When investigating Data Availability Statements (DAS), we learned about how researchers use repositories, use data that is available through application to a consortium, and make their data available in Supporting Information Files. Yet, we also found several examples of unsavory researcher behavior. Several authors listed the data as available in non-existent repositories. For example, on researcher stated that his data was available at an institutional data access point that does not exist. Other researchers listed the data as available on their lab websites, yet when librarians examined the lab website, there wasn’t any data available.

Uninformed Researcher Behavior

Additionally, other Data Availability Statements (DAS) seemed to demonstrate a lack of understanding on what constitutes “data” and what should be included in a statement. One statement reads, “No datasets were generated or analyzed during the current study,” even though the researchers took samples and analyzed them in the publication. Other DAS’s did not list enough information for a researcher to track down the data described. For example, one stated, “NLM has access to all the data and data are available upon request.” With so little information, it seems unlikely that the data could be located and re-used in a meaningful way.

What Librarians Can Do

While it may be easy to assume that all of these researchers are bad actors, it is also possible that the researchers require more guidance in order to write helpful and meaningful DAS’s. As librarians, we can advocate for better DAS’s by providing information on what the DAS is meant to accomplish - guide other researchers to the data for re-use or replications. While it could be helpful for librarians to develop templates, data varies immensely across disciplines and projects. Providing the logic of the DAS will allow researchers to extrapolate about what information is necessary within the boundaries of their project and their domain.

Finding Data to Index: Data available in a repository

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. In this blog post, we will talk about Data Availability Statements that list the data as available through a repository.

Out of the 327 Data Availability Statements we reviewed, 87 listed the data as available in a repository. This means that these publications either deposited the data that the authors collected or that the authors used data that was already available in a repository. Eight of the statements indicated the data is stored in figshare, five in GitHub, four in the Open Science Framework (OSF), four in Protein Data Bank (PDB), four in Dryad, three in Dataverse, and 59 in an NCBI database, like GEO. For example, the data used in the article, “Lymphatic endothelial S1P promotes naive T cell mitochondrial function and survival,” has been deposited as GEO dataset GSE97249.

repository1.JPG

For certain statements, the data in the repository was only a portion of the data used in the article. For example, in the Data Availability Statement included below, only the behavioral data and the electrophysiological data are available through figshare. The raw electrophysiological data is only available on request to the corresponding author. We included this in our tally of datasets shared in a repository even though it did not include all of the data.

repository2.JPG

While these datasets ought to be easy to catalog and make more discoverable through the NYU Data Catalog, the team ran into a significant challenge when looking through these 87 data availability statements. As we experienced with data that was made available in the Supporting Information files, it can be difficult to determine whether or not the data deposited in the repository is sufficient for other researchers to use. If, for example, a researcher only deposits aggregate data, is it worthwhile for the team at the NYU Health Sciences Library to catalog?

As with the data made available in the Supporting Information files, we address this concern on a case-by-case basis. We look at the deposited data to ensure that there is more than aggregate data, that additional context like codebooks or readme files are included, and that the data corresponds to what is described in the methods section of the publication. While we do not have the expertise to ensure that the data is accurate and adequate for re-use for all disciplines, we do ensure that the dataset passes our baseline assessment before including it in the NYU Data Catalog.

Regardless of these challenges, the value of including these datasets in an institutional data catalog is apparent. Cataloging the dataset allows researchers to search for the data itself, as opposed to trying to find the data through the article. It allows researchers to search for data as the primary resource, thereby providing a clearer path to access. Furthermore, researchers may not know to search every data repository when looking for data to re-use; they may not even know of the existence of each of the repositories. By including a description of the dataset, access information, and a link to the repository in the data catalog, we remove this burden from researchers looking to access data and help researchers who deposited data reach more of their colleagues.

Finding Data to Index: Data available through a consortium

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. This blog post will talk about Data Availability Statements that list the data as available through request to a consortium or a committee.

In the bibliographic data we pulled from PMC, approximately 2% of the journal articles listed the data as available by request through a consortium or a committee. What this means is that another researcher could request the data from a third party that is charged with managing re-use requests for the data. You can see an example of a data availability statement for such a dataset below:

Example Data Availability Statement

Example Data Availability Statement

While some of the data used in this example is available through the Gene Ontology Database (GEO), additional data from the Bipolar Disorders Working Group of the Psychiatric Genomics Consortium was used. In this situation, the authors applied for access to data from the Psychiatric Genomics Consortium, and through the Data Availability Statement, are instructing readers on where they may obtain the data for re-use. The Psychiatric Genomics Consortium, in turn, provides detailed documentation of the data.

The several modes of access described in Data Availability Statements, data that is available through a consortium or committee is one of the rarer options. Many of these datasets are large genomic datasets or datasets generated by city, state, or national entities (e.g., the New York City Department of Education). Because of the size of the dataset and the pre-existing infrastructure for re-use, many of these datasets have already helped researchers generate new research and have resulted in multiple publications. For the example, the NYU Data Catalog includes a record for the dataset collected during the Counseling African Americans to Control Hypertension (CAATCH) Trial which is governed by a Study Oversight Committee. This dataset has already generated at least 6 separate journal articles

The benefit of cataloging these types of datasets is that they were previously only discoverable through serendipity, or through referral by a colleague. When datasets available through a consortium or committee are indexed in the data catalog, the work that these committees have already done to make their data re-usable will be more visible to the public, and researchers will be more likely to re-use these large and valuable datasets.

Finding Data to Index: Data found in Supporting Information files

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. Second in this series, this post discusses articles that state all relevant data are included in the Supporting Information files.

Finding Data to Index: Experimenting with PMC

Since the early days of the Data Catalog, we have experimented with different ways to locate institutional datasets suitable for indexing. Recently, with the help of the folks at the National Library of Medicine (NLM), we were able to create a new workflow for locating data. In a series of blog posts, we will be writing about our experiences using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets as well as what we learned about our institution’s data sharing practices from this exercise.

This series will be split into five blog posts:

  1. This introductory post

  2. An examination of data found in supporting information files

  3. An examination of data available through a consortium

  4. An examination of data available through a repository

  5. Additional findings about our researchers and their data

In April 2018, NLM announced new search filters for PubMed and PMC. The “has data avail” filter allows users to narrow their search to journal articles that have data availability statements. Using that filter, we were able to limit our search to journal articles that included data availability statements and had at least one author affiliated with NYU. Our search strategy is listed below:

has data avail[filter] AND ((nyu langone school of medicine[ad]) OR (new york university langone school of medicine[ad]) OR (langone school of medicine[ad]) OR (New york univ School of Medicine[ad]) OR (Nyu School of Medicine[ad]) OR (New York University School of Medicine[ad]) OR (langone medical center[ad]) OR (nyu medical center[ad]) OR (new york university medical center[ad]) OR (new york university langone[ad]) OR (langone health[ad]) OR (NYU Langone[ad]) OR (nyulmc[ad]) OR (nyumc[ad]) OR (NYU Medical School[ad]) OR (New York University Medical School[ad]) OR (hospital for joint disease[ad]) OR (hospital for joint diseases[ad]) OR (harkness center for dance[ad]))

Once we identified the total number of articles that fit our criteria, our developer (whose other work you can read about here) pulled all 517 results into a spreadsheet for us to review. Together, we worked to identify articles that included viable datasets that could be indexed in the Data Catalog.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of  Staphylococcus aureus.  Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of Staphylococcus aureus. Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

The purpose of this exercise was to explore what we could learn about our institution, our researchers, and their data. By examining each data availability statement and investigating the information provided, we were able to categorize data availability statements into four discrete groups:

  • Data available by emailing the author

  • Data available by applying to a consortium

  • Data available through a repository

  • Data available in supporting information files

This approach also allowed us to locate researchers who had not, or were no longer, complying with publishers’ open data requirements.

This exploration of the “has data available” filter demonstrates the heterogeneity of data practices in biomedical research which in turn demonstrates with the flexibility of the Data Catalog is imperative. By pointing researchers to other resources and not requiring them to upload their data, the Data Catalog can accommodate the wide variety of ways that researchers choose to make their data available.