Librarians practicing what we preach: Making our Library Research Discoverable through the Pitt Data Catalog

Academic and hospital libraries that offer data services often provide guidance and training on data sharing and reuse, covering topics such as:

  • funder/journal requirements for sharing

  • the benefits of data sharing such as enhanced transparency and reproducibility and the potential to find new collaborators

  • proper documentation to accompany a dataset

  • how to identify appropriate data repositories and evaluate them to determine the most suitable

  • locating existing datasets for reuse

In 2017, we at the University of Pittsburgh, Health Sciences Library System, decided it was time that we “practice what we preach” and over the next two years deposited four datasets from our own research into the repository figshare. Our initial goals were to:

  • understand and document the data deposit workflow in order to assist researchers

  • facilitate requests from colleagues to share our data and survey instruments

  • make unpublished results discoverable

  • track the usage of our data

  • model best practices to researchers and librarian colleagues

Given the new data sharing policy for the Journal of the Medical Library Association that will go into effect October 2019, we believe this last goal is of particular importance.

As the University of Pittsburgh is one of the nine partners of the Data Catalog Collaboration Project (DCCP), in addition to depositing our datasets we also included a metadata record to each dataset within the Pitt Data Catalog. Available datasets to date:

These records increase the visibility of our data (one of the mission statements of the DCCP) and provide an additional access point.

This blog post is adapted from the MLA presentation: Ratajeski, M.A. and Iwema, C.L. (2019, May). Practicing What We Preach: Making Our Own Research Data Open Access. Lightning Talk presented at Medical Library Association Annual Conference, Chicago, IL.

DCCP at MLA '19: Check out our Slides, Notes, Posters and More!

It was an eventful week in Chicago for MLA ‘19!

While we wish everyone was able to make it to the conference, we know that isn’t always possible, so we have uploaded all of the slides, posters, and notes related to the DCCP and our work. Below, we have listed a description of each presentation, the slides or poster, and a person to contact if you have any questions.

The DCCP Information Session

Kevin Read presenting at the DCCP Information Session at MLA ‘19

Kevin Read presenting at the DCCP Information Session at MLA ‘19

  • Provided information about what it means to join the DCCP, implementing the Data Catalog, and how different institutions are using the catalog for their specific needs

  • Link to slides

  • Link to notes

  • Contact: Kevin Read, DCCP Project Lead:

Paper presentation: From Conception to Action: Elevating Library Projects through Collaboration between Librarians and Developers

  • Demonstrates how developers and librarians have worked together on the Data Catalog, as well as other library projects and provides tips on how to improve developer and librarian collaborations

  • Link to the slides

  • Contact: Ian Lamb, Solutions Developer,

Paper presentation: Developing Workflows to Facilitate the Sharing of Electronic Health Record Data

  • Discusses how NYU created a process to include Electronic Health Record (EHR) data in the NYU Data Catalog. Outlines the workflow and provides example records for EHR data in the NYU Data Catalog

  • Link to the slides

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator:

Paper presentation: Creating Institution Specific Resources on Data Transfer and Data Sharing

  • Illustrates how NYU supplements their work on the NYU Data Catalog with ongoing projects to help researchers transfer and share their data while still being in compliance with national regulation, funder and publisher requirements, and institutional policy

  • Link to the slides

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator:

Poster: A Multisite Collaboration to Improve Data Curation and Discovery in Academic Health Sciences Centers


  • Provided information on what the Data Catalog Collaboration is, what our goals are, and ways that the Data Catalog is used at participating institutions

  • Contact: Kevin Read, DCCP Project Lead:

  • Link to the poster

Poster: Outreach Strategies and Researchers’ Motivations for Sharing Data through a Data Catalog

  • Demonstrated why researchers share data through the Data Catalog as well as the outreach strategies employed at different institutions in the DCCP

  • Link to the poster

  • Contact: Melissa A. Ratajeski, Pitt Data Catalog Lead and Coordinator of Data Services at the University of Pittsburgh Health Sciences Library System,

Poster: Using the PubMed Central Data Availability Search Filter and an Institutional Data Catalog to Make Data more Discoverable

  • Illustrates how NYU is using the PubMed Central (PMC) Data Availability Search filter to add new datasets to the NYU Data Catalog. Includes the workflow and an example record

  • Link to the poster

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator,

The DCCP at MLA 19: Interested in Data Discovery? Come to the Information Session on the Data Catalog Collaboration Project

For all librarians attending MLA in Chicago this year, the DCCP will be hosting another information session. This session is open for anyone to attend, and geared towards librarians who are:

  • Interested in making their institutional research data discoverable

  • Seeking ways to support researchers as they share and discover data

  • Interested in learning about how librarians from many different institutions collaborate to improve the discovery of their institutional data  

When? Sunday May 5 @ 12pm - 12:55pm

Where? Gold Coast Room (West Tower, Concourse/Bronze Level)

During the session we will help others understand what it takes to get a data catalog up and running with a plan for sustainability going forward, provide examples of the ways the DCCP works to index data at our respective institutions, outline the level of support and priorities offered by each DCCP institution, and describe the successes and challenges faced when working in the realm of data discovery.

We would love for the majority of this session to be discussion with members of the audience. So whether you are interested in implementing the data catalog locally, are working on your own data discovery efforts and want to share your thoughts, or are a librarian seeking to learn more about data discovery, we hope you will join us!

Learning from the Data Curation Network

Last week, I attended the Data Curation Network (DCN) Workshop at Johns Hopkins University in Baltimore, MD. The DCN is a consortium of libraries that work together to help curate research data in order to make that data more Findable, Accessible, Interoperable, and Reusable (FAIR).

Tweet from Cynthia Hudson-Vitale (Head of Research Informatics and Publishing at Penn State) about the workshop

Tweet from Cynthia Hudson-Vitale (Head of Research Informatics and Publishing at Penn State) about the workshop

The objectives of the DCN and the DCCP are closely aligned. With their hands on the data, DCN curators work with researchers to ensure that the data put into repositories is FAIR. The DCCP, on the other hand, works with researchers to make their data discoverable, even if that data is not in a repository. In other words, where the DCN deals with data curation as a whole, including cataloging the data, the DCCP focuses primarily on cataloging only. I attended the workshop not just to learn more about data curation myself but also to see what lessons the DCCP can learn from the DCN.

The DCN created and uses a workflow that they call CURATE. This workflow walks curators through the steps of making a dataset FAIR and provides checklists for each step to act as a guide. Because the DCCP works with data that is not necessarily in a repository, not all aspects of the workflow are applicable to our model. However, these checklists provide an excellent resource as we at the DCCP work to improve our data cataloging and provide guidance to new members.

We at the DCCP look forward to reviewing all the resources created by DCN members and working together to ensure that data is FAIR, no matter where it is stored.

Finding Data To Index: When the Data Availability Statement Leads Nowhere

This blog post is final part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. This blogpost is the last in the series and will discuss additional findings related to the bibliometric data we pulled from PMC.

Unsavory Researcher Behavior

When investigating Data Availability Statements (DAS), we learned about how researchers use repositories, use data that is available through application to a consortium, and make their data available in Supporting Information Files. Yet, we also found several examples of unsavory researcher behavior. Several authors listed the data as available in non-existent repositories. For example, on researcher stated that his data was available at an institutional data access point that does not exist. Other researchers listed the data as available on their lab websites, yet when librarians examined the lab website, there wasn’t any data available.

Uninformed Researcher Behavior

Additionally, other Data Availability Statements (DAS) seemed to demonstrate a lack of understanding on what constitutes “data” and what should be included in a statement. One statement reads, “No datasets were generated or analyzed during the current study,” even though the researchers took samples and analyzed them in the publication. Other DAS’s did not list enough information for a researcher to track down the data described. For example, one stated, “NLM has access to all the data and data are available upon request.” With so little information, it seems unlikely that the data could be located and re-used in a meaningful way.

What Librarians Can Do

While it may be easy to assume that all of these researchers are bad actors, it is also possible that the researchers require more guidance in order to write helpful and meaningful DAS’s. As librarians, we can advocate for better DAS’s by providing information on what the DAS is meant to accomplish - guide other researchers to the data for re-use or replications. While it could be helpful for librarians to develop templates, data varies immensely across disciplines and projects. Providing the logic of the DAS will allow researchers to extrapolate about what information is necessary within the boundaries of their project and their domain.

Identifying Significant Growth in Data Sharing: Results from the Annual NYU Data Catalog Contributor Survey

The NYU Data Catalog is designed to facilitate data sharing, and with data from our annual surveys in 2018 and 2019, we can now see growth in the number of researchers participating in the NYU Data Catalog and in the number of interactions researchers have around data sharing because of the NYU Data Catalog.

Of those researchers who responded to our 2019 survey (48.2% response rate), 46.3% were contacted at least once about data sharing and the NYU Data Catalog. This represents a marked increase in the percentage of researchers who reported being contacted in 2018 (27.8%). Furthermore, between 2018 and 2019, there was a 51% increase in the number of contributors to the data catalog.

Rubella research. Photograph from the National Library of Medicine Digital Collections, UID: 101541114. Available at:

Rubella research. Photograph from the National Library of Medicine Digital Collections, UID: 101541114. Available at:

Researchers that were surveyed either serve as local experts on external datasets, like the New York City Community Health Survey, or they have contributed research datasets that are a product of their original research, like Dr. Scott Sherman’s CHART New York Smoking-Cessation Interventions for Urban Hospital Patients Dataset. The annual NYU Data Catalog Contributor survey allows us to gain a better understanding of how researchers are using the catalog and sharing their data, thus providing a way to measure change in data sharing practices over time.

The annual surveys ask five questions:

  • Have you generated new datasets this year?

  • Are you willing to have the new datasets described in the NYU Data Catalog?

  • Are there any changes or modifications to the datasets already described in the NYU Data Catalog?

  • Please briefly describe those changes to your dataset(s).

  • How many times have you been contacted by people asking about a dataset in the NYU Data Catalog?

In a later blog post, we will discuss other data points and new questions that were added to the survey in 2019 to help us better understand researcher data sharing practices.

Finding Data to Index: Data available in a repository

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. In this blog post, we will talk about Data Availability Statements that list the data as available through a repository.

Out of the 327 Data Availability Statements we reviewed, 87 listed the data as available in a repository. This means that these publications either deposited the data that the authors collected or that the authors used data that was already available in a repository. Eight of the statements indicated the data is stored in figshare, five in GitHub, four in the Open Science Framework (OSF), four in Protein Data Bank (PDB), four in Dryad, three in Dataverse, and 59 in an NCBI database, like GEO. For example, the data used in the article, “Lymphatic endothelial S1P promotes naive T cell mitochondrial function and survival,” has been deposited as GEO dataset GSE97249.


For certain statements, the data in the repository was only a portion of the data used in the article. For example, in the Data Availability Statement included below, only the behavioral data and the electrophysiological data are available through figshare. The raw electrophysiological data is only available on request to the corresponding author. We included this in our tally of datasets shared in a repository even though it did not include all of the data.


While these datasets ought to be easy to catalog and make more discoverable through the NYU Data Catalog, the team ran into a significant challenge when looking through these 87 data availability statements. As we experienced with data that was made available in the Supporting Information files, it can be difficult to determine whether or not the data deposited in the repository is sufficient for other researchers to use. If, for example, a researcher only deposits aggregate data, is it worthwhile for the team at the NYU Health Sciences Library to catalog?

As with the data made available in the Supporting Information files, we address this concern on a case-by-case basis. We look at the deposited data to ensure that there is more than aggregate data, that additional context like codebooks or readme files are included, and that the data corresponds to what is described in the methods section of the publication. While we do not have the expertise to ensure that the data is accurate and adequate for re-use for all disciplines, we do ensure that the dataset passes our baseline assessment before including it in the NYU Data Catalog.

Regardless of these challenges, the value of including these datasets in an institutional data catalog is apparent. Cataloging the dataset allows researchers to search for the data itself, as opposed to trying to find the data through the article. It allows researchers to search for data as the primary resource, thereby providing a clearer path to access. Furthermore, researchers may not know to search every data repository when looking for data to re-use; they may not even know of the existence of each of the repositories. By including a description of the dataset, access information, and a link to the repository in the data catalog, we remove this burden from researchers looking to access data and help researchers who deposited data reach more of their colleagues.

Cataloging Software and 3D Models in the Pitt Data Catalog

When the Health Sciences Library System at the University of Pittsburgh launched the Pitt Data Catalog last spring, we wanted to provide researchers with flexible options for advertising and sharing their data. Now that the catalog has grown to describe more than 20 Pitt-created datasets, that flexibility has led our collection development in surprising and exciting directions. We have recently added our first records describing software code and 3D models, all created by Dr. Charles C. Horn.

Dr. Horn is an associate professor of medicine who studies gut-brain communication, particularly via the vagus nerve. His research makes use of several open-source software packages, which he demonstrates in his paper (with David M. Rosenberg) “Neurophysiological analytics for all! Free open-source software tools for documenting, analyzing, visualizing, and sharing using electronic notebooks.” Electrophysiological data used to demonstrate the software tools are available in the publication’s data supplements and on Github, where Dr. Horn has also uploaded scripts and a Docker image containing tools to make neurophysiological data analysis easier. Pitt Data Catalog records linking to those software/data packages include:

Dr. Horn has also designed several printable 3D models for experimental apparatuses in electrophysiology. The files shared through the NIH 3D Print Exchange include printable files in a variety of formats, photos, and assembly instructions. The 3D model records in the Pitt Data Catalog are:

From a collections standpoint, expanding our catalog to include software and 3D models is a logical consequence of our mission to collect Pitt-authored data, especially in computational fields where relatively few data products fit the definition of a traditional “dataset.” So far, the DCCP’s metadata schema has proven flexible enough to accommodate these new entity types, but we may pursue some software-specific modifications if the need arises. Shortly after Pitt published these records, NYU added their own first software record, so this may be the beginning of a collaboration-wide trend or a new working group, similar to the DCCP Basic Science Working Group.

Finding Data to Index: Data available through a consortium

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. This blog post will talk about Data Availability Statements that list the data as available through request to a consortium or a committee.

In the bibliographic data we pulled from PMC, approximately 2% of the journal articles listed the data as available by request through a consortium or a committee. What this means is that another researcher could request the data from a third party that is charged with managing re-use requests for the data. You can see an example of a data availability statement for such a dataset below:

Example Data Availability Statement

Example Data Availability Statement

While some of the data used in this example is available through the Gene Ontology Database (GEO), additional data from the Bipolar Disorders Working Group of the Psychiatric Genomics Consortium was used. In this situation, the authors applied for access to data from the Psychiatric Genomics Consortium, and through the Data Availability Statement, are instructing readers on where they may obtain the data for re-use. The Psychiatric Genomics Consortium, in turn, provides detailed documentation of the data.

The several modes of access described in Data Availability Statements, data that is available through a consortium or committee is one of the rarer options. Many of these datasets are large genomic datasets or datasets generated by city, state, or national entities (e.g., the New York City Department of Education). Because of the size of the dataset and the pre-existing infrastructure for re-use, many of these datasets have already helped researchers generate new research and have resulted in multiple publications. For the example, the NYU Data Catalog includes a record for the dataset collected during the Counseling African Americans to Control Hypertension (CAATCH) Trial which is governed by a Study Oversight Committee. This dataset has already generated at least 6 separate journal articles

The benefit of cataloging these types of datasets is that they were previously only discoverable through serendipity, or through referral by a colleague. When datasets available through a consortium or committee are indexed in the data catalog, the work that these committees have already done to make their data re-usable will be more visible to the public, and researchers will be more likely to re-use these large and valuable datasets.

Harlem Health Advocacy Partners and a Case Study in Data Re-Use

In the fall of this year, a Research and Data Librarian at the NYU Health Sciences Library, Fred LaPolla, was brought in to help teach an Intensive Research Practicum for Primary Care Residents. Dr. Colleen Gillespie, the Director of the Division of Education Quality in the Institute for Innovations in Medical Education and an Associate Professor in the Department of Medicine, led the practicum and wanted residents to ask a question of a secondary dataset, analyze the data, present the results, and write up a draft of a manuscript in 10 days. Prior to the beginning of the practicum, LaPolla pointed Dr. Gillespie to the NYU Data Catalog, and she was able to contact Dr. Lorna Thorpe about the Harlem Health Advocacy Partners Data Set.

“West 125th Street looking west from Seventh Avenue, Harlem, New York City” From the Schomburg Center for Research in Black Culture, Photographs, and Prints Division, The New York Public Library. 1946.

“West 125th Street looking west from Seventh Avenue, Harlem, New York City” From the Schomburg Center for Research in Black Culture, Photographs, and Prints Division, The New York Public Library. 1946.

The Harlem Health Advocacy Partners (HHAP) dataset was collected in five public housing developments in Harlem, New York City, where the chronic disease burden is high. Two rounds of data collection were performed: first, a telephone survey of 1,633 individuals and second, an interventional study of 370 individuals.The variables through these two rounds of data collection included age, gender, race/ethnicity, employment status, health insurance, self-reported general health, self-reported mental health, level of physical activity, smoker status, BMI, blood pressure, level of social connectedness, and specific health conditions including asthma, diabetes, hypertension, and depression. Previous articles published with this data include “A Place-Based Community Health Worker Program: Feasibility and Early Outcomes, New York City, 2015,” published in the American Journal of Preventive Medicine.

After completing the practicum, the residents worked together with Dr. Gillespie, Dr. Thorpe, and Mr. LaPolla to submit the manuscript for publication as co-authors. This case study in data re-use illustrates how the NYU Data Catalog fits into the data ecosystem, bridging connections between researchers and helping people locate relevant datasets. It also illustrates how important data re-use can be to young researchers and students, as it can provide access to data without the high cost of them having to collect it themselves, or pay for that data.

Data in the News: ProPublica and the U.S. Health and Retirement Study

As the year winds down and we all recover from the busy holiday season, ProPublica published an article on the ways in which employers push older U.S. workers out of their jobs. The article, “If You’re Over 50, Chances Are The Decision to Leave a Job Won’t be Yours,” by Peter Gosselin uses data from the U.S. Health and Retirement Study (HRS) from the University of Michigan. Gosselin refers to HRS as the “premier source of quantitative information about aging in America,” as it provides longitudinal data about 20,000 people in the United States from the age of 50 and older.

The NYU Data Catalog includes datasets collected outside of NYU (e.g. by the U.S. Census Bureau or by other universities) in order to help researchers locate datasets that they may not otherwise know about. The HRS is an one of the external datasets included in the NYU Data Catalog, and two faculty members act as local experts on the dataset for other researchers at NYU. While not all instances of the Data Catalog include local experts, at NYU we include information on researchers who have already worked on a dataset in order to encourage collaboration at the institution. Local experts are institutional researchers with experience using the dataset who agree to help guide researchers as they decide whether a dataset can answer their questions or provide meaningful information.

What the ProPublica article demonstrates (as well as the many articles in PubMed that feature the dataset) is that a single dataset can be used to investigate a wide variety of questions, if the analysis is done properly. For example, while Gosselin uses the dataset to investigate how U.S. workers are pushed out of their jobs and the financial ramifications of this practice, Virginia Chang, a researcher in the College of Global Public Health at NYU, has used it to investigate the effects of obesity on the survival rates of common acute illnesses.

The Data Catalog was designed to increase cross-disciplinary research and collaboration, and Gosselin’s article illustrates how research data can benefit the public when many people with different areas of expertise have access to it.

Finding Data to Index: Data found in Supporting Information files

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. Second in this series, this post discusses articles that state all relevant data are included in the Supporting Information files.

Finding Data to Index: Experimenting with PMC

Since the early days of the Data Catalog, we have experimented with different ways to locate institutional datasets suitable for indexing. Recently, with the help of the folks at the National Library of Medicine (NLM), we were able to create a new workflow for locating data. In a series of blog posts, we will be writing about our experiences using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets as well as what we learned about our institution’s data sharing practices from this exercise.

This series will be split into five blog posts:

  1. This introductory post

  2. An examination of data found in supporting information files

  3. An examination of data available through a consortium

  4. An examination of data available through a repository

  5. Additional findings about our researchers and their data

In April 2018, NLM announced new search filters for PubMed and PMC. The “has data avail” filter allows users to narrow their search to journal articles that have data availability statements. Using that filter, we were able to limit our search to journal articles that included data availability statements and had at least one author affiliated with NYU. Our search strategy is listed below:

has data avail[filter] AND ((nyu langone school of medicine[ad]) OR (new york university langone school of medicine[ad]) OR (langone school of medicine[ad]) OR (New york univ School of Medicine[ad]) OR (Nyu School of Medicine[ad]) OR (New York University School of Medicine[ad]) OR (langone medical center[ad]) OR (nyu medical center[ad]) OR (new york university medical center[ad]) OR (new york university langone[ad]) OR (langone health[ad]) OR (NYU Langone[ad]) OR (nyulmc[ad]) OR (nyumc[ad]) OR (NYU Medical School[ad]) OR (New York University Medical School[ad]) OR (hospital for joint disease[ad]) OR (hospital for joint diseases[ad]) OR (harkness center for dance[ad]))

Once we identified the total number of articles that fit our criteria, our developer (whose other work you can read about here) pulled all 517 results into a spreadsheet for us to review. Together, we worked to identify articles that included viable datasets that could be indexed in the Data Catalog.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of  Staphylococcus aureus.  Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of Staphylococcus aureus. Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

The purpose of this exercise was to explore what we could learn about our institution, our researchers, and their data. By examining each data availability statement and investigating the information provided, we were able to categorize data availability statements into four discrete groups:

  • Data available by emailing the author

  • Data available by applying to a consortium

  • Data available through a repository

  • Data available in supporting information files

This approach also allowed us to locate researchers who had not, or were no longer, complying with publishers’ open data requirements.

This exploration of the “has data available” filter demonstrates the heterogeneity of data practices in biomedical research which in turn demonstrates with the flexibility of the Data Catalog is imperative. By pointing researchers to other resources and not requiring them to upload their data, the Data Catalog can accommodate the wide variety of ways that researchers choose to make their data available.

Data Catalog Collaboration Project receives CTSA Great Team Science Contest Award for Top Importance

what is team science?

Team science is a collaborative effort to address scientific challenges that leverage the strengths and expertise of professionals trained in different fields. One of the overarching goals of the Clinical and Translational Science Awards (CTSA) given to select institutions is to promote team science through establishing mechanisms by which biomedical researchers can collaborate, be trained in why team science is important, and develop evaluation measures to assess teamwork in biomedical research contexts.

about the award

Last week, the Data Catalog Collaboration Project (DCCP) found out that they had received an award from the CTSA Great Team Science Contest, which asked CTSA-funded hubs to submit examples of team science successes to be evaluated by a review panel and presented at the fall meeting. Each application was scored based on a number of categories: overall score, top importance, top innovation, top impact, among others. 170 applications were submitted, and the DCCP received the highest score for the Top Importance category. I was able to present the topic at the Fall CTSA Program Meeting where I could discuss the value of the data catalog approach to leaders in biomedical translational research. The people I spoke to were most interested in how the data catalog can help them make disparate, hard to find research datasets that are spread out and stored in various places across their institution more discoverable using a single system.

Expanding our reach beyond libraries

From our perspective, the most exciting part about receiving this award was that our approach of having libraries implement local data catalogs, establishing collaborations between librarians and developers to improve data discovery, fostering partnerships with our local institutional research initiatives, and making concerted efforts to reduce the barrier on the research community to share was seen as the most important project by a community that expands well beyond the realm of libraries. This is a considerable achievement because the other projects that were submitted were very strong in addressing a diverse range of team science initiatives. The DCCP has long been an advocate of ensuring that institutional research data is discoverable, available and usable regardless of where it is stored, and this award is an acknowledgement that the broader biomedical research community agrees.

The DCCP has grown to 8 libraries in total working to improve institutional data discovery, and this award can serve as evidence of its value to libraries or broader institutions interested in improving their data discovery needs. The DCCP members all provide a great service to their institution, and to the other libraries participating in this effort. If you are interested in being a part of this effort, please reach out to us.

An Unexpected Discovery

Since the launch of the UMB Data Catalog one year ago, the HS/HSL’s DC team has worked to create records reflecting the diversity of studies undertaken by researchers in the University’s schools of medicine, nursing, pharmacy, dentistry, and social work.  As a result, the datasets curated thus far range from foster parent experiences with Maryland’s court process to a collection of pharmaceutical clinical study reports on neuraminidase inhibitors for treating influenza to pre-ART HIV care outcomes in adults in Kenya and Tanzania. Working with our researchers has been a rewarding experience which, on occasion, has led to unexpected discoveries.

Recently, through correspondence with a faculty member in the school of medicine, we learned of the existence of a highly specialized resource. Since 1993 researchers in the SOM’s Division of Endocrinology, Diabetes and Nutrition have been studying the Old Order Amish (OOA) community in Lancaster County, Pennsylvania. The current population of approximately 35,000 individuals is directly descended from a small number of Anabaptist Christian founder families who immigrated to the United States in the late 1700’s. Their simplistic lifestyle and common lineage are ideally suited for epidemiological, genetic, and other health-related investigations. The Amish Complex Genetic Disease Database is the result of over 2 decades of research associated with this distinctive population. Data has been compiled from over 7,000 volunteers who have participated in one or more studies in a variety of investigational areas including diabetes, longevity, whole genome sequencing, blood pressure, and osteoporosis. Funded by a series of grants, work with the Amish over the last 20+ years has resulted in more than 200 papers the majority of which utilize the information accumulated in the database.

This is a prime example of a unique resource the significance of which is now highlighted by the metadata in its descriptive record. Through the UMB Data Catalog the visibility of the Amish Complex Genetic Disease Database has been increased providing opportunities for data re-use and future collaborations.

Cataloging EHR Data: Experiences at NYU Langone Health

The implementation of Electronic Health Record (EHR) systems has allowed researchers to leverage clinical data for research purposes. At NYU Langone Health, researchers are able to work with administrators to pull data from the EHR system and study the patient population of NYU Langone Health as well as the health care services offered here. To assist in this process, the NYU Health Sciences Library began cataloging this data in the NYU Data Catalog.

Making Connections: Google Dataset Search, the Data Catalog, and Linked Open Data

Guest post by Ian Lamb, Senior Developer at NYU Health Sciences Library

Early in the development of the data catalog at NYU, one of our team members floated the idea of publishing the dataset records in a Linked Open Data (LOD) format. As the developer on the project, the implementation of this idea would fall to me, so I did what every good developer does in this situation: pretend to know what’s going on, and then Google it when the meeting’s over.

At the time, LOD had not seen much uptake in the tech world, which would explain my ignorance, but libraries, museums, and other institutions were interested in the idea. The phrase refers to data that is “linked” to other data, in a machine-readable way, so that machines can begin to understand what our webpages are really about. For instance, if we have a dataset that was published by the US Department of Energy, we probably already state that in a human-readable way, i.e. in a sentence on the page. But it takes a great deal of effort for a computer to figure out what that means, requiring the use of natural language processing and other artificial intelligence algorithms. LOD allows us to explain this relationship to a computer in a much more direct fashion.

Using a standardized data interchange format (such as JSON-LD), along with a formal schema such as those from, we can explicitly link this dataset to the Department of Energy. And when this dataset becomes linked to the US Department of Energy, it also becomes linked to a lot of other information – where the Department of Energy headquarters is, who runs it, other types of research it may produce, where its funding comes from, etc. A machine can follow these relationships and start to form a picture of what the Department of Energy really is, and how our dataset fits into the big picture.

This network of structured, linked data underpins the Semantic Web – a proposed new evolution of the web in which the whole internet would be imbued with meaning in this way. The Semantic Web was first proposed in 2001 in an article by Tim Berners-Lee and colleagues, and the phrase Linked Open Data followed a few years later. But while these concepts have been around for decades, there are still relatively few companies or institutions that publish LOD, and even fewer that utilize it. The Wikipedia article on the Semantic Web has a whole section explaining some reasons for this. Indeed, the most mainstream use of JSON-LD up to this point has been to improve search engine rankings and alter the appearance of a search result in Google.

So imagine my surprise recently, when Google quietly rolled out a search engine for datasets that relies almost entirely on data they’ve harvested from repositories and data catalogs using JSON-LD. Apparently Google has not given up on linked data. They’re just taking their time with it.

Dataset Search, as it is humbly called, is still in beta, and there do appear to be some quirks in the metadata it displays. Take the Asian American Partnership in Research and Empowerment dataset, for example. Here is the record as it appears on our data catalog, and here it is in Google’s search. To figure out why the search result appears so sparse, I fed it into Google’s Structured Data Testing Tool, which shows us the dataset as Google sees it based on our linked data output. (The JSON-LD itself is embedded in the page source, and you can see it by scrolling the left-hand frame down to line 39.)

The tool shows that Google is receiving lots of metadata about the dataset itself, including authors, keywords, associated citations, file formats, and more. It also shows the various relationships we’re specifying for it: the dataset record isPartOf a data catalog, the provider of which is the NYU Health Sciences Library, which has a parentOrganization of the NYU School of Medicine, and the url for that entity is…etc. But out of all this data, only the authors and the description currently appear on Google’s search result. There is a link to some citations there, but they’re not the ones we’ve specified – they appear to simply be searching for the name of the dataset in Google Scholar. Similarly, Google lists the dataset as provided by New York University, which, while correct in a sense, is not the entity we’ve specified as the provider or even the parent organization of the provider.

Some of this discrepancy is because we’ve chosen slightly different field names than Google has. For example, we are listing the file format as an encodingFormat attached to the main dataset entity, but using another Google tool, I’ve found that they expect any encodingFormats to be listed under a distribution, which would look like this:

“name”: “Asian American Partnership in Research and Empowerment”,

"description": "Project AsPIRE (Asian American Partnership in Research and Endowment) was a community-based…",

"distribution": {

      "@type": "DataDownload",

      "encodingFormat": "SPSS"


Other fields, such as Keywords, don’t appear to be supported at all. At this stage we can’t know if they’re going to be adding any additional fields or how much of this is going to change, but we have identified a few small alterations we can make to our JSON-LD output that should improve our results.

So there are some kinks to be worked out for sure, and it remains to be seen how much Google is going to invest in this effort and how aggressively they plan to publicize it. But it does seem to represent a significant move into this space. And if it does pick up steam, there is a very substantial upside for any institution using our catalog code: instant inclusion in a worldwide, highly-visible dataset search engine.