Finding Data to Index: Data found in Supporting Information files

This blog is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post.

In the bibliographic data that we pulled from PMC, many of the data availability statements said that all of the relevant data were included in the Supporting Information files. you can see an example of such a statement below.


While cataloging data listed in Supporting Information files may seem like low-hanging fruit, there was a significant challenge to indexing this type of data. This challenge was determining whether or not the data in the Supporting Information files is actually enough data to justify the creation of a Data Catalog entry. At times, the data and documentation included in the Supporting Information files does not appear to be enough for other researchers to re-use the data meaningfully. For example, the Supporting Information file depicted below includes a file with relevant data, but because it is a PDF, it is not re-usable.


So, the question for the Data Catalog team was: how do we determine whether there is enough information included in the Supporting Information files to merit a Data Catalog entry?

Answer this question is difficult, and the team’s approach to indexing these datasets is dealt with on a case-by-case basis. Generally, we inspect the Supporting Information files to ensure that more than aggregate data is included, that additional contextual information like codebooks are included, and that the data correspond to what is described in the methods section of the paper. Without expertise in the research areas of each paper, it is difficult to guarantee that the data is accurate and adequate for re-use. However, if a dataset passes our baseline assessment, we decided that it was better to err towards trusting that the researcher included all of the relevant data and documentation.

Below, you can find an example of a Data Catalog record that features data included in the Supporting Information files of an article. The Data Catalog record is linked to the publication itself, enabling Data Catalog users to see the publication as well as a description of the data.


By highlighting the data featured in the Supporting Information files, we are able to make that data findable within our institution. If the data were not cataloged, a researcher could only find it by searching for the article, and therefor they may not locate the dataset if they were searching for data for re-use as opposed to an article. Because the data is already available online, including it in the Data Catalog is a key use case that demonstrates how the catalog functions to make research data more discoverable, regardless of where it lives.