Finding Data to Index: Data available in a repository

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. In this blog post, we will talk about Data Availability Statements that list the data as available through a repository.

Out of the 327 Data Availability Statements we reviewed, 87 listed the data as available in a repository. This means that these publications either deposited the data that the authors collected or that the authors used data that was already available in a repository. Eight of the statements indicated the data is stored in figshare, five in GitHub, four in the Open Science Framework (OSF), four in Protein Data Bank (PDB), four in Dryad, three in Dataverse, and 59 in an NCBI database, like GEO. For example, the data used in the article, “Lymphatic endothelial S1P promotes naive T cell mitochondrial function and survival,” has been deposited as GEO dataset GSE97249.


For certain statements, the data in the repository was only a portion of the data used in the article. For example, in the Data Availability Statement included below, only the behavioral data and the electrophysiological data are available through figshare. The raw electrophysiological data is only available on request to the corresponding author. We included this in our tally of datasets shared in a repository even though it did not include all of the data.


While these datasets ought to be easy to catalog and make more discoverable through the NYU Data Catalog, the team ran into a significant challenge when looking through these 87 data availability statements. As we experienced with data that was made available in the Supporting Information files, it can be difficult to determine whether or not the data deposited in the repository is sufficient for other researchers to use. If, for example, a researcher only deposits aggregate data, is it worthwhile for the team at the NYU Health Sciences Library to catalog?

As with the data made available in the Supporting Information files, we address this concern on a case-by-case basis. We look at the deposited data to ensure that there is more than aggregate data, that additional context like codebooks or readme files are included, and that the data corresponds to what is described in the methods section of the publication. While we do not have the expertise to ensure that the data is accurate and adequate for re-use for all disciplines, we do ensure that the dataset passes our baseline assessment before including it in the NYU Data Catalog.

Regardless of these challenges, the value of including these datasets in an institutional data catalog is apparent. Cataloging the dataset allows researchers to search for the data itself, as opposed to trying to find the data through the article. It allows researchers to search for data as the primary resource, thereby providing a clearer path to access. Furthermore, researchers may not know to search every data repository when looking for data to re-use; they may not even know of the existence of each of the repositories. By including a description of the dataset, access information, and a link to the repository in the data catalog, we remove this burden from researchers looking to access data and help researchers who deposited data reach more of their colleagues.