NIAID Now | October 28, 2021
Reflections on a Year of COVID-19 Data Sharing
NIAID-supported research can help fuel further discovery when data are shared quickly in discoverable repositories, following community standards for metadata. Data sharing enables more rapid and open scrutiny of research results and outcomes and allows data across studies to be easily combined and analyzed.
In 2020, NIAID encouraged its grantees to rapidly share COVID-19 research results. Across disciplines, the Research Data Alliance (RDA) has published recommendations and guidelines for COVID-19 data sharing. Broad adoption of these guidelines led to an unprecedented volume of data being shared, and we have seen additional practices that augment the immense potential of the data:
- Federal agencies like NIH, public consortia, and private entities provide COVID-19 open-access data and computational resources that are freely available to researchers and do not restrict their re-use. The Registry of Research Data Repositories (re3data) portal provides researchers a comprehensive catalog of open data repositories for consideration.
- Rapid dissemination of research results prior to peer-reviewed publication, in data repositories or even in preprint servers such as bioRxiv, medRxiv, and arXiv, has been transformative in enabling transparent and collaborative research.
- Access to clinical data through secure repositories such as NIAID’s clinical trial data portal and the NIH National COVID-19 Cohort Collaborative (N3C) data enclave has been critical in providing evidence for COVID-19 treatment guidelines.
- Significant use of social media and discussion forums such as Virological, a discussion forum for virus molecular evolution and epidemiology, or SARS-CoV-2 SPHERES Slack channel for discussing genomic epidemiology, can alert the community about data releases and encourage collaboration on SARS-CoV-2 research to maximize the reach of the data.
Following community standards when depositing data (PDF) (e.g., a community-defined vocabulary for the metadata) is pivotal to addressing scientific and public health questions and to maximizing the impact of SARS-CoV-2 data. Collaborative efforts across scientific domains are ongoing to define minimal as well as optimal metadata and their automated capture. For example, Public Health Alliance for Genomic Epidemiology (PHA4GE) is working toward a standard for pathogen genomic sequences, the National COVID Cohort Collaborative (N3C) standardizes clinical and electronic medical records data into a re-useable format using an OMOP common data model, and the NCATS OpenData portal shares data and standardized approaches for SARS-CoV-2 assay and animal model data.
Now, over a year into the COVID-19 pandemic, researchers from around the world have contributed over 2.5 million SARS-CoV-2 genomic sequences, 1,371 SARS-CoV-2 protein structures, 315 reagents to the NIAID BEI Resources catalog, 7.3 billion rows of clinical data in the N3C database, and over 125,000 papers about the novel SARS-CoV-2 virus and the pandemic. Despite these great advances, more work remains to take full advantage of the troves of research data available. For example, data sharing is still slow for many data types; genomic sequences from U.S. infections are shared on average 28 days after sample collection. Similarly, too much data are released as figures in publications or pre-prints using non-standardized formats or lacking metadata, which requires significant manual curation and harmonization prior to re-use. Some great strides have been achieved by sites like outbreak.info to automate data extraction and harmonization, but these advances are not yet applicable to all data types.
Continued implementation of best practices in data management and sharing will enable even faster public health decisions and accelerate the development of diagnostics, therapeutics, and vaccines in response to emerging health threats.