data discovery

Librarians practicing what we preach: Making our Library Research Discoverable through the Pitt Data Catalog

Academic and hospital libraries that offer data services often provide guidance and training on data sharing and reuse, covering topics such as:

  • funder/journal requirements for sharing

  • the benefits of data sharing such as enhanced transparency and reproducibility and the potential to find new collaborators

  • proper documentation to accompany a dataset

  • how to identify appropriate data repositories and evaluate them to determine the most suitable

  • locating existing datasets for reuse

In 2017, we at the University of Pittsburgh, Health Sciences Library System, decided it was time that we “practice what we preach” and over the next two years deposited four datasets from our own research into the repository figshare. Our initial goals were to:

  • understand and document the data deposit workflow in order to assist researchers

  • facilitate requests from colleagues to share our data and survey instruments

  • make unpublished results discoverable

  • track the usage of our data

  • model best practices to researchers and librarian colleagues

Given the new data sharing policy for the Journal of the Medical Library Association that will go into effect October 2019, we believe this last goal is of particular importance.

As the University of Pittsburgh is one of the nine partners of the Data Catalog Collaboration Project (DCCP), in addition to depositing our datasets we also included a metadata record to each dataset within the Pitt Data Catalog. Available datasets to date:

These records increase the visibility of our data (one of the mission statements of the DCCP) and provide an additional access point.

This blog post is adapted from the MLA presentation: Ratajeski, M.A. and Iwema, C.L. (2019, May). Practicing What We Preach: Making Our Own Research Data Open Access. Lightning Talk presented at Medical Library Association Annual Conference, Chicago, IL.

DCCP at MLA '19: Check out our Slides, Notes, Posters and More!

It was an eventful week in Chicago for MLA ‘19!

While we wish everyone was able to make it to the conference, we know that isn’t always possible, so we have uploaded all of the slides, posters, and notes related to the DCCP and our work. Below, we have listed a description of each presentation, the slides or poster, and a person to contact if you have any questions.

The DCCP Information Session

Kevin Read presenting at the DCCP Information Session at MLA ‘19

Kevin Read presenting at the DCCP Information Session at MLA ‘19

  • Provided information about what it means to join the DCCP, implementing the Data Catalog, and how different institutions are using the catalog for their specific needs

  • Link to slides

  • Link to notes

  • Contact: Kevin Read, DCCP Project Lead:

Paper presentation: From Conception to Action: Elevating Library Projects through Collaboration between Librarians and Developers

  • Demonstrates how developers and librarians have worked together on the Data Catalog, as well as other library projects and provides tips on how to improve developer and librarian collaborations

  • Link to the slides

  • Contact: Ian Lamb, Solutions Developer,

Paper presentation: Developing Workflows to Facilitate the Sharing of Electronic Health Record Data

  • Discusses how NYU created a process to include Electronic Health Record (EHR) data in the NYU Data Catalog. Outlines the workflow and provides example records for EHR data in the NYU Data Catalog

  • Link to the slides

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator:

Paper presentation: Creating Institution Specific Resources on Data Transfer and Data Sharing

  • Illustrates how NYU supplements their work on the NYU Data Catalog with ongoing projects to help researchers transfer and share their data while still being in compliance with national regulation, funder and publisher requirements, and institutional policy

  • Link to the slides

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator:

Poster: A Multisite Collaboration to Improve Data Curation and Discovery in Academic Health Sciences Centers


  • Provided information on what the Data Catalog Collaboration is, what our goals are, and ways that the Data Catalog is used at participating institutions

  • Contact: Kevin Read, DCCP Project Lead:

  • Link to the poster

Poster: Outreach Strategies and Researchers’ Motivations for Sharing Data through a Data Catalog

  • Demonstrated why researchers share data through the Data Catalog as well as the outreach strategies employed at different institutions in the DCCP

  • Link to the poster

  • Contact: Melissa A. Ratajeski, Pitt Data Catalog Lead and Coordinator of Data Services at the University of Pittsburgh Health Sciences Library System,

Poster: Using the PubMed Central Data Availability Search Filter and an Institutional Data Catalog to Make Data more Discoverable

  • Illustrates how NYU is using the PubMed Central (PMC) Data Availability Search filter to add new datasets to the NYU Data Catalog. Includes the workflow and an example record

  • Link to the poster

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator,

Identifying Significant Growth in Data Sharing: Results from the Annual NYU Data Catalog Contributor Survey

The NYU Data Catalog is designed to facilitate data sharing, and with data from our annual surveys in 2018 and 2019, we can now see growth in the number of researchers participating in the NYU Data Catalog and in the number of interactions researchers have around data sharing because of the NYU Data Catalog.

Of those researchers who responded to our 2019 survey (48.2% response rate), 46.3% were contacted at least once about data sharing and the NYU Data Catalog. This represents a marked increase in the percentage of researchers who reported being contacted in 2018 (27.8%). Furthermore, between 2018 and 2019, there was a 51% increase in the number of contributors to the data catalog.

Rubella research. Photograph from the National Library of Medicine Digital Collections, UID: 101541114. Available at:

Rubella research. Photograph from the National Library of Medicine Digital Collections, UID: 101541114. Available at:

Researchers that were surveyed either serve as local experts on external datasets, like the New York City Community Health Survey, or they have contributed research datasets that are a product of their original research, like Dr. Scott Sherman’s CHART New York Smoking-Cessation Interventions for Urban Hospital Patients Dataset. The annual NYU Data Catalog Contributor survey allows us to gain a better understanding of how researchers are using the catalog and sharing their data, thus providing a way to measure change in data sharing practices over time.

The annual surveys ask five questions:

  • Have you generated new datasets this year?

  • Are you willing to have the new datasets described in the NYU Data Catalog?

  • Are there any changes or modifications to the datasets already described in the NYU Data Catalog?

  • Please briefly describe those changes to your dataset(s).

  • How many times have you been contacted by people asking about a dataset in the NYU Data Catalog?

In a later blog post, we will discuss other data points and new questions that were added to the survey in 2019 to help us better understand researcher data sharing practices.

Harlem Health Advocacy Partners and a Case Study in Data Re-Use

In the fall of this year, a Research and Data Librarian at the NYU Health Sciences Library, Fred LaPolla, was brought in to help teach an Intensive Research Practicum for Primary Care Residents. Dr. Colleen Gillespie, the Director of the Division of Education Quality in the Institute for Innovations in Medical Education and an Associate Professor in the Department of Medicine, led the practicum and wanted residents to ask a question of a secondary dataset, analyze the data, present the results, and write up a draft of a manuscript in 10 days. Prior to the beginning of the practicum, LaPolla pointed Dr. Gillespie to the NYU Data Catalog, and she was able to contact Dr. Lorna Thorpe about the Harlem Health Advocacy Partners Data Set.

“West 125th Street looking west from Seventh Avenue, Harlem, New York City” From the Schomburg Center for Research in Black Culture, Photographs, and Prints Division, The New York Public Library. 1946.

“West 125th Street looking west from Seventh Avenue, Harlem, New York City” From the Schomburg Center for Research in Black Culture, Photographs, and Prints Division, The New York Public Library. 1946.

The Harlem Health Advocacy Partners (HHAP) dataset was collected in five public housing developments in Harlem, New York City, where the chronic disease burden is high. Two rounds of data collection were performed: first, a telephone survey of 1,633 individuals and second, an interventional study of 370 individuals.The variables through these two rounds of data collection included age, gender, race/ethnicity, employment status, health insurance, self-reported general health, self-reported mental health, level of physical activity, smoker status, BMI, blood pressure, level of social connectedness, and specific health conditions including asthma, diabetes, hypertension, and depression. Previous articles published with this data include “A Place-Based Community Health Worker Program: Feasibility and Early Outcomes, New York City, 2015,” published in the American Journal of Preventive Medicine.

After completing the practicum, the residents worked together with Dr. Gillespie, Dr. Thorpe, and Mr. LaPolla to submit the manuscript for publication as co-authors. This case study in data re-use illustrates how the NYU Data Catalog fits into the data ecosystem, bridging connections between researchers and helping people locate relevant datasets. It also illustrates how important data re-use can be to young researchers and students, as it can provide access to data without the high cost of them having to collect it themselves, or pay for that data.

Data in the News: ProPublica and the U.S. Health and Retirement Study

As the year winds down and we all recover from the busy holiday season, ProPublica published an article on the ways in which employers push older U.S. workers out of their jobs. The article, “If You’re Over 50, Chances Are The Decision to Leave a Job Won’t be Yours,” by Peter Gosselin uses data from the U.S. Health and Retirement Study (HRS) from the University of Michigan. Gosselin refers to HRS as the “premier source of quantitative information about aging in America,” as it provides longitudinal data about 20,000 people in the United States from the age of 50 and older.

The NYU Data Catalog includes datasets collected outside of NYU (e.g. by the U.S. Census Bureau or by other universities) in order to help researchers locate datasets that they may not otherwise know about. The HRS is an one of the external datasets included in the NYU Data Catalog, and two faculty members act as local experts on the dataset for other researchers at NYU. While not all instances of the Data Catalog include local experts, at NYU we include information on researchers who have already worked on a dataset in order to encourage collaboration at the institution. Local experts are institutional researchers with experience using the dataset who agree to help guide researchers as they decide whether a dataset can answer their questions or provide meaningful information.

What the ProPublica article demonstrates (as well as the many articles in PubMed that feature the dataset) is that a single dataset can be used to investigate a wide variety of questions, if the analysis is done properly. For example, while Gosselin uses the dataset to investigate how U.S. workers are pushed out of their jobs and the financial ramifications of this practice, Virginia Chang, a researcher in the College of Global Public Health at NYU, has used it to investigate the effects of obesity on the survival rates of common acute illnesses.

The Data Catalog was designed to increase cross-disciplinary research and collaboration, and Gosselin’s article illustrates how research data can benefit the public when many people with different areas of expertise have access to it.

Data Catalog Collaboration Project receives CTSA Great Team Science Contest Award for Top Importance

what is team science?

Team science is a collaborative effort to address scientific challenges that leverage the strengths and expertise of professionals trained in different fields. One of the overarching goals of the Clinical and Translational Science Awards (CTSA) given to select institutions is to promote team science through establishing mechanisms by which biomedical researchers can collaborate, be trained in why team science is important, and develop evaluation measures to assess teamwork in biomedical research contexts.

about the award

Last week, the Data Catalog Collaboration Project (DCCP) found out that they had received an award from the CTSA Great Team Science Contest, which asked CTSA-funded hubs to submit examples of team science successes to be evaluated by a review panel and presented at the fall meeting. Each application was scored based on a number of categories: overall score, top importance, top innovation, top impact, among others. 170 applications were submitted, and the DCCP received the highest score for the Top Importance category. I was able to present the topic at the Fall CTSA Program Meeting where I could discuss the value of the data catalog approach to leaders in biomedical translational research. The people I spoke to were most interested in how the data catalog can help them make disparate, hard to find research datasets that are spread out and stored in various places across their institution more discoverable using a single system.

Expanding our reach beyond libraries

From our perspective, the most exciting part about receiving this award was that our approach of having libraries implement local data catalogs, establishing collaborations between librarians and developers to improve data discovery, fostering partnerships with our local institutional research initiatives, and making concerted efforts to reduce the barrier on the research community to share was seen as the most important project by a community that expands well beyond the realm of libraries. This is a considerable achievement because the other projects that were submitted were very strong in addressing a diverse range of team science initiatives. The DCCP has long been an advocate of ensuring that institutional research data is discoverable, available and usable regardless of where it is stored, and this award is an acknowledgement that the broader biomedical research community agrees.

The DCCP has grown to 8 libraries in total working to improve institutional data discovery, and this award can serve as evidence of its value to libraries or broader institutions interested in improving their data discovery needs. The DCCP members all provide a great service to their institution, and to the other libraries participating in this effort. If you are interested in being a part of this effort, please reach out to us.