Finding Data to Index: Data available in a repository

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. In this blog post, we will talk about Data Availability Statements that list the data as available through a repository.

Out of the 327 Data Availability Statements we reviewed, 87 listed the data as available in a repository. This means that these publications either deposited the data that the authors collected or that the authors used data that was already available in a repository. Eight of the statements indicated the data is stored in figshare, five in GitHub, four in the Open Science Framework (OSF), four in Protein Data Bank (PDB), four in Dryad, three in Dataverse, and 59 in an NCBI database, like GEO. For example, the data used in the article, “Lymphatic endothelial S1P promotes naive T cell mitochondrial function and survival,” has been deposited as GEO dataset GSE97249.


For certain statements, the data in the repository was only a portion of the data used in the article. For example, in the Data Availability Statement included below, only the behavioral data and the electrophysiological data are available through figshare. The raw electrophysiological data is only available on request to the corresponding author. We included this in our tally of datasets shared in a repository even though it did not include all of the data.


While these datasets ought to be easy to catalog and make more discoverable through the NYU Data Catalog, the team ran into a significant challenge when looking through these 87 data availability statements. As we experienced with data that was made available in the Supporting Information files, it can be difficult to determine whether or not the data deposited in the repository is sufficient for other researchers to use. If, for example, a researcher only deposits aggregate data, is it worthwhile for the team at the NYU Health Sciences Library to catalog?

As with the data made available in the Supporting Information files, we address this concern on a case-by-case basis. We look at the deposited data to ensure that there is more than aggregate data, that additional context like codebooks or readme files are included, and that the data corresponds to what is described in the methods section of the publication. While we do not have the expertise to ensure that the data is accurate and adequate for re-use for all disciplines, we do ensure that the dataset passes our baseline assessment before including it in the NYU Data Catalog.

Regardless of these challenges, the value of including these datasets in an institutional data catalog is apparent. Cataloging the dataset allows researchers to search for the data itself, as opposed to trying to find the data through the article. It allows researchers to search for data as the primary resource, thereby providing a clearer path to access. Furthermore, researchers may not know to search every data repository when looking for data to re-use; they may not even know of the existence of each of the repositories. By including a description of the dataset, access information, and a link to the repository in the data catalog, we remove this burden from researchers looking to access data and help researchers who deposited data reach more of their colleagues.

Harlem Health Advocacy Partners and a Case Study in Data Re-Use

In the fall of this year, a Research and Data Librarian at the NYU Health Sciences Library, Fred LaPolla, was brought in to help teach an Intensive Research Practicum for Primary Care Residents. Dr. Colleen Gillespie, the Director of the Division of Education Quality in the Institute for Innovations in Medical Education and an Associate Professor in the Department of Medicine, led the practicum and wanted residents to ask a question of a secondary dataset, analyze the data, present the results, and write up a draft of a manuscript in 10 days. Prior to the beginning of the practicum, LaPolla pointed Dr. Gillespie to the NYU Data Catalog, and she was able to contact Dr. Lorna Thorpe about the Harlem Health Advocacy Partners Data Set.

“West 125th Street looking west from Seventh Avenue, Harlem, New York City” From the Schomburg Center for Research in Black Culture, Photographs, and Prints Division, The New York Public Library. 1946.

“West 125th Street looking west from Seventh Avenue, Harlem, New York City” From the Schomburg Center for Research in Black Culture, Photographs, and Prints Division, The New York Public Library. 1946.

The Harlem Health Advocacy Partners (HHAP) dataset was collected in five public housing developments in Harlem, New York City, where the chronic disease burden is high. Two rounds of data collection were performed: first, a telephone survey of 1,633 individuals and second, an interventional study of 370 individuals.The variables through these two rounds of data collection included age, gender, race/ethnicity, employment status, health insurance, self-reported general health, self-reported mental health, level of physical activity, smoker status, BMI, blood pressure, level of social connectedness, and specific health conditions including asthma, diabetes, hypertension, and depression. Previous articles published with this data include “A Place-Based Community Health Worker Program: Feasibility and Early Outcomes, New York City, 2015,” published in the American Journal of Preventive Medicine.

After completing the practicum, the residents worked together with Dr. Gillespie, Dr. Thorpe, and Mr. LaPolla to submit the manuscript for publication as co-authors. This case study in data re-use illustrates how the NYU Data Catalog fits into the data ecosystem, bridging connections between researchers and helping people locate relevant datasets. It also illustrates how important data re-use can be to young researchers and students, as it can provide access to data without the high cost of them having to collect it themselves, or pay for that data.

Cataloging EHR Data: Experiences at NYU Langone Health

The implementation of Electronic Health Record (EHR) systems has allowed researchers to leverage clinical data for research purposes. At NYU Langone Health, researchers are able to work with administrators to pull data from the EHR system and study the patient population of NYU Langone Health as well as the health care services offered here. To assist in this process, the NYU Health Sciences Library began cataloging this data in the NYU Data Catalog.