Workflows

DCCP at MLA '19: Check out our Slides, Notes, Posters and More!

It was an eventful week in Chicago for MLA ‘19!

While we wish everyone was able to make it to the conference, we know that isn’t always possible, so we have uploaded all of the slides, posters, and notes related to the DCCP and our work. Below, we have listed a description of each presentation, the slides or poster, and a person to contact if you have any questions.

The DCCP Information Session

Kevin Read presenting at the DCCP Information Session at MLA ‘19

Kevin Read presenting at the DCCP Information Session at MLA ‘19

  • Provided information about what it means to join the DCCP, implementing the Data Catalog, and how different institutions are using the catalog for their specific needs

  • Link to slides

  • Link to notes

  • Contact: Kevin Read, DCCP Project Lead: kevin.read@med.nyu.edu

Paper presentation: From Conception to Action: Elevating Library Projects through Collaboration between Librarians and Developers

  • Demonstrates how developers and librarians have worked together on the Data Catalog, as well as other library projects and provides tips on how to improve developer and librarian collaborations

  • Link to the slides

  • Contact: Ian Lamb, Solutions Developer, ian.lamb@nyulangone.org

Paper presentation: Developing Workflows to Facilitate the Sharing of Electronic Health Record Data

  • Discusses how NYU created a process to include Electronic Health Record (EHR) data in the NYU Data Catalog. Outlines the workflow and provides example records for EHR data in the NYU Data Catalog

  • Link to the slides

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator: nicole.contaxis@nyulangone.org

Paper presentation: Creating Institution Specific Resources on Data Transfer and Data Sharing

  • Illustrates how NYU supplements their work on the NYU Data Catalog with ongoing projects to help researchers transfer and share their data while still being in compliance with national regulation, funder and publisher requirements, and institutional policy

  • Link to the slides

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator: nicole.contaxis@nyulangone.org

Poster: A Multisite Collaboration to Improve Data Curation and Discovery in Academic Health Sciences Centers

dccp_general_poster.jpg

  • Provided information on what the Data Catalog Collaboration is, what our goals are, and ways that the Data Catalog is used at participating institutions

  • Contact: Kevin Read, DCCP Project Lead: Kevin.Read@med.nyu.edu

  • Link to the poster

Poster: Outreach Strategies and Researchers’ Motivations for Sharing Data through a Data Catalog

dccp_outreach_poster.jpg
  • Demonstrated why researchers share data through the Data Catalog as well as the outreach strategies employed at different institutions in the DCCP

  • Link to the poster

  • Contact: Melissa A. Ratajeski, Pitt Data Catalog Lead and Coordinator of Data Services at the University of Pittsburgh Health Sciences Library System, mar@pitt.edu

Poster: Using the PubMed Central Data Availability Search Filter and an Institutional Data Catalog to Make Data more Discoverable

PMC_MLA19_Poster.jpg
  • Illustrates how NYU is using the PubMed Central (PMC) Data Availability Search filter to add new datasets to the NYU Data Catalog. Includes the workflow and an example record

  • Link to the poster

  • Contact: Nicole Contaxis, NYU Data Catalog Coordinator, nicole.contaxis@nyulangone.org

Finding Data To Index: When the Data Availability Statement Leads Nowhere

This blog post is final part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. This blogpost is the last in the series and will discuss additional findings related to the bibliometric data we pulled from PMC.

Unsavory Researcher Behavior

When investigating Data Availability Statements (DAS), we learned about how researchers use repositories, use data that is available through application to a consortium, and make their data available in Supporting Information Files. Yet, we also found several examples of unsavory researcher behavior. Several authors listed the data as available in non-existent repositories. For example, on researcher stated that his data was available at an institutional data access point that does not exist. Other researchers listed the data as available on their lab websites, yet when librarians examined the lab website, there wasn’t any data available.

Uninformed Researcher Behavior

Additionally, other Data Availability Statements (DAS) seemed to demonstrate a lack of understanding on what constitutes “data” and what should be included in a statement. One statement reads, “No datasets were generated or analyzed during the current study,” even though the researchers took samples and analyzed them in the publication. Other DAS’s did not list enough information for a researcher to track down the data described. For example, one stated, “NLM has access to all the data and data are available upon request.” With so little information, it seems unlikely that the data could be located and re-used in a meaningful way.

What Librarians Can Do

While it may be easy to assume that all of these researchers are bad actors, it is also possible that the researchers require more guidance in order to write helpful and meaningful DAS’s. As librarians, we can advocate for better DAS’s by providing information on what the DAS is meant to accomplish - guide other researchers to the data for re-use or replications. While it could be helpful for librarians to develop templates, data varies immensely across disciplines and projects. Providing the logic of the DAS will allow researchers to extrapolate about what information is necessary within the boundaries of their project and their domain.

Finding Data to Index: Data available through a consortium

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. To learn more about the background of this project and how we pulled the bibliographic data used, please refer to our first post. This blog post will talk about Data Availability Statements that list the data as available through request to a consortium or a committee.

In the bibliographic data we pulled from PMC, approximately 2% of the journal articles listed the data as available by request through a consortium or a committee. What this means is that another researcher could request the data from a third party that is charged with managing re-use requests for the data. You can see an example of a data availability statement for such a dataset below:

Example Data Availability Statement

Example Data Availability Statement

While some of the data used in this example is available through the Gene Ontology Database (GEO), additional data from the Bipolar Disorders Working Group of the Psychiatric Genomics Consortium was used. In this situation, the authors applied for access to data from the Psychiatric Genomics Consortium, and through the Data Availability Statement, are instructing readers on where they may obtain the data for re-use. The Psychiatric Genomics Consortium, in turn, provides detailed documentation of the data.

The several modes of access described in Data Availability Statements, data that is available through a consortium or committee is one of the rarer options. Many of these datasets are large genomic datasets or datasets generated by city, state, or national entities (e.g., the New York City Department of Education). Because of the size of the dataset and the pre-existing infrastructure for re-use, many of these datasets have already helped researchers generate new research and have resulted in multiple publications. For the example, the NYU Data Catalog includes a record for the dataset collected during the Counseling African Americans to Control Hypertension (CAATCH) Trial which is governed by a Study Oversight Committee. This dataset has already generated at least 6 separate journal articles

The benefit of cataloging these types of datasets is that they were previously only discoverable through serendipity, or through referral by a colleague. When datasets available through a consortium or committee are indexed in the data catalog, the work that these committees have already done to make their data re-usable will be more visible to the public, and researchers will be more likely to re-use these large and valuable datasets.

Finding Data to Index: Data found in Supporting Information files

This blog post is part of a series on using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets and what we at NYU learned about our institution’s data sharing practices from this exercise. Second in this series, this post discusses articles that state all relevant data are included in the Supporting Information files.

Finding Data to Index: Experimenting with PMC

Since the early days of the Data Catalog, we have experimented with different ways to locate institutional datasets suitable for indexing. Recently, with the help of the folks at the National Library of Medicine (NLM), we were able to create a new workflow for locating data. In a series of blog posts, we will be writing about our experiences using the “has data avail” filter on PubMed Central (PMC) to identify a wide range of institutional datasets as well as what we learned about our institution’s data sharing practices from this exercise.

This series will be split into five blog posts:

  1. This introductory post

  2. An examination of data found in supporting information files

  3. An examination of data available through a consortium

  4. An examination of data available through a repository

  5. Additional findings about our researchers and their data

In April 2018, NLM announced new search filters for PubMed and PMC. The “has data avail” filter allows users to narrow their search to journal articles that have data availability statements. Using that filter, we were able to limit our search to journal articles that included data availability statements and had at least one author affiliated with NYU. Our search strategy is listed below:

has data avail[filter] AND ((nyu langone school of medicine[ad]) OR (new york university langone school of medicine[ad]) OR (langone school of medicine[ad]) OR (New york univ School of Medicine[ad]) OR (Nyu School of Medicine[ad]) OR (New York University School of Medicine[ad]) OR (langone medical center[ad]) OR (nyu medical center[ad]) OR (new york university medical center[ad]) OR (new york university langone[ad]) OR (langone health[ad]) OR (NYU Langone[ad]) OR (nyulmc[ad]) OR (nyumc[ad]) OR (NYU Medical School[ad]) OR (New York University Medical School[ad]) OR (hospital for joint disease[ad]) OR (hospital for joint diseases[ad]) OR (harkness center for dance[ad]))

Once we identified the total number of articles that fit our criteria, our developer (whose other work you can read about here) pulled all 517 results into a spreadsheet for us to review. Together, we worked to identify articles that included viable datasets that could be indexed in the Data Catalog.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Search Results in PMC for articles published by researchers at NYU that also include a data availability statement.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of  Staphylococcus aureus.  Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

Example of a data availability statement, taken from: Chapman JR, Balasubramanian D, Tam K, Askenazi M, Copin R, Shopsin B, Torres VJ, Uederheide BM. Using Quantitative Spectrometry to Understand the Influence of Genetics and Nutritional Perturbations on the Virulence Potential of Staphylococcus aureus. Mol Cel Proteomics. 2017 Apr; 16(4 Suppl 1): S15 - S28.

The purpose of this exercise was to explore what we could learn about our institution, our researchers, and their data. By examining each data availability statement and investigating the information provided, we were able to categorize data availability statements into four discrete groups:

  • Data available by emailing the author

  • Data available by applying to a consortium

  • Data available through a repository

  • Data available in supporting information files

This approach also allowed us to locate researchers who had not, or were no longer, complying with publishers’ open data requirements.

This exploration of the “has data available” filter demonstrates the heterogeneity of data practices in biomedical research which in turn demonstrates with the flexibility of the Data Catalog is imperative. By pointing researchers to other resources and not requiring them to upload their data, the Data Catalog can accommodate the wide variety of ways that researchers choose to make their data available.

Cataloging EHR Data: Experiences at NYU Langone Health

The implementation of Electronic Health Record (EHR) systems has allowed researchers to leverage clinical data for research purposes. At NYU Langone Health, researchers are able to work with administrators to pull data from the EHR system and study the patient population of NYU Langone Health as well as the health care services offered here. To assist in this process, the NYU Health Sciences Library began cataloging this data in the NYU Data Catalog.