NIAID has made a significant investment in genomic-related activities that provide comprehensive genomic, functional genomics, bioinformatics, structural genomics, proteomics and integrated "omics" data sets, resources and reagents to the scientific community for basic and applied research in infectious diseases. This wealth of genomics and other data sets, as well as the availability of the human genome, provides a valuable and critical resource for the scientific community.
This document serves to provide general guiding principles and specific guidelines to prepare and establish consistent data release plans across NIAID/DMID Omics Centers, including Genomic Centers for Infectious Diseases (GCID), Systems Biology for Infectious Diseases Program, Structural Genomics and Functional Genomics Centers as well as other NIAID-funded large-scale centers and projects. These guiding principles and guidelines are consistent with the NIH Data sharing guidelines as well as Genomic Data Sharing policy, but underline the expectation that both human and non-human data be released in a timeline that is consistent with NIAID’s dual mandate to support basic and clinical research as well as to respond to public health emergencies. These guidelines are also consistent with contemporary principles, such as F.A.I.R. (Findable, Accessible, Interoperable, and Reproducible) standards for data release.
NIAID acknowledges that projects among the centers are diverse, and therefore, considers it of the highest importance to develop flexible and reasonable guidelines that achieve rapid data release and yet are sensitive to the aims of the centers and their individual projects. Continued discussions of the data release guidelines will be an ongoing function of the centers, Scientific Working Groups, scientific community, and NIAID. Modifications to NIAID Data Release Guidelines and data release plans may be needed because of these discussions and other data release guidelines developed by the National Institutes of Health (NIH).
Rapid and unrestricted sharing of data and research resources is essential for advancing research on human health and infectious diseases. The utility of data and resources to the scientific community is largely dependent on how quickly these data are deposited into public databases, whether the data are easy to find, accessible and can be re-used by others. NIAID is committed to rapid release of experimental data including genomic and other large-scale data types and in addition, recognizes that clinical data and other metadata associated with the genomic, omics, and other data are valuable research resources. For these reasons, NIAID endorses rapid release of all these data sets and anticipates that data generated will be made freely available via deposition into publicly accessible and searchable international databases as GenBank and National Center for Biotechnology Information (NCBI) and to the NIAID-funded databases such as DMID Bioinformatics Resource Centers (BRC) or other databases designated and approved by NIAID.
In turn, users of any released data are expected to act responsibly to recognize the scientific contribution of the data generators/producers by following fair use of unpublished data and normal standards of scientific etiquette. Such guidelines can be found in Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility, and the Toronto Data Release Workshop (Nature 461, 168-170 (10 September 2009) | doi: 10.1038/461168a; Published online 9 September 2009).
Data Sharing and Release Plans
Projects designated by NIAID to rapidly share data for public access should specifically develop their data sharing and release plans based on guidelines outlined in this document. Investigators are encouraged to discuss their plans for data and resource sharing with NIAID Program Officers. Plans will be reviewed and approved by NIAID. For projects generating large scale genomic data, the data sharing and release plan should contain information specific to the Genomic Data Sharing Plan as described in the NIH Genomic Data Sharing Policy. Importantly, this NIAID guidance makes more explicit and ensures rapid data release timelines especially pertaining to non-human genomic data that supersede those stipulated in the NIH GDS policy.
Specific Guidelines for Data Types
Sequence Data including Genome, Transcriptome, Microbiome, Epigenome, Metagenomics
All raw genome or metagenome data generated using sequencing approaches should be submitted as rapidly as possible and no later than 45 calendar days after quality control to the Short Read Archive or, as appropriate, to dbGAP at the National Center for Biotechnology Information (NCBI)/National Library of Medicine/NIH. These data should also include information on sequencing platforms, libraries, quality values for each sequence, primers, templates, and vectors, and quality values for each sequence, as appropriate. This includes the broad application of next generation sequencing, including for example RNAseq, ChIPseq, TnSeq, SNP profiling, among many others.
Full or partial genome and metagenome assemblies and their annotations should be submitted to GenBank either as individual samples or for defined cohorts of samples as rapidly as possible and no later than 45 calendar days of being generated and validated, followed by release to other web sites, as approved by NIAID.
It is expected that GenBank records for genome assemblies and annotation contain language to acknowledge the funding source and the joint ownership of the Genbank records by the NIAID-funded GCID and the Bioinformatics Resource Centers or database reviewed and approved by NIAID. NIAID recommends the following language to be added to the COMMENT field of Genbank records: “This work was supported by the National Institute of Allergy and Infectious Diseases (NIAID), Genome Centers for Infectious Diseases (GCID) program. This record is co-owned by the NIAID-funded GSCID and the Bioinformatics Resource Centers,” or another center as designated by NIAID.
In unusual cases and as agreed upon by NIAID, the Institute will consider minimal delay in release of sequence or assembly data to NCBI. Delayed data release for sequence data should be discussed and justified in the data sharing and release plan submitted to NIAID and would require prior NIAID approval.
Clinical Data and Other Metadata
NIAID expects that relevant metadata (clinical data or any other type of data such as antibiotic resistance) that are essential for the biological interpretation of genome sequence data and other omics and experimental data sets will be submitted and made publicly available through the appropriate NIAID Bioinformatics Resource Center, or other databases designated by NIAID such as the NCBI dbGAP at the same time as the experimental data. It is expected that a metadata and/or clinical data release plan will be defined prior to the initiation of data generation and will be agreed upon by NIAID. The plan will include 1) a list of metadata to be released, 2) the database(s) they will be released to, and 3) timelines of data release.
In unusual cases and as agreed upon by NIAID, release of the metadata can be delayed. It is expected in this case that the metadata will be submitted to an NIAID Bioinformatics Resource Center, or other databases designated by NIAID such as the NCBI dbGAP at the same time that the genomics, omics, or other generated data types are submitted for public access to NCBI or database designated by NIAID. These metadata or clinical data may be embargoed at an NIAID Bioinformatics Resource Center, or other databases designated by NIAID such as the NCBI dbGAP, for up to nine months or upon publication, whichever comes first and as agreed upon by NIAID.
Release of Patient/Donor Identifying Data
NIAID has sought advice on human subjects' privacy protection issues related to releasing human clinical data and established an external Working Group consisting of scientists with expertise in clinical research data management and infectious diseases. Points to consider when sharing and releasing clinical metadata were developed and are described here to assist reviewing and identifying clinical or other metadata fields that may potentially identify human subjects. Investigators should address in the Genomic Data Sharing Plan all the requirements for sharing of human genomic and clinical data in agreement with the NIH Genomic Data Sharing Policy, including to address relevant human subjects’ protection issues, and the inclusion of an Institutional Certification for sharing such data.
The rights and privacy of human subjects who participate in clinical research studies shall be protected at all times. Clinical metadata, genomic, or other data sets, or a subset of the clinical and other metadata that may potentially identify human subjects should be carefully reviewed and identified prior to sharing and releasing any clinical metadata to openly accessible public databases. Eighteen data elements defined by the Health Insurance Portability and Accountability Act of 1996 (HIPAA) safe harbor standard must be considered in this review. It is recognized that even with a careful and comprehensive review of the clinical metadata fields, there may be a risk of re-identification. Public release of clinical metadata should follow the guidance on release of metadata above. In some cases, potentially identifying data may be deposited in a controlled access database as designated by NIAID, such as dbGAP
All NIAID-funded studies involving human subjects should explicitly seek consent for future research use of samples and broad sharing of participant data. Participants who do not consent to future use or broad data sharing may still participate in the primary study, if consistent with study design. Whenever possible, studies should seek broad consent for general research use of the samples and consent should not limit the types of users who may access the data.
Single nucleotide polymorphisms (SNP) for human genomic data should be submitted as rapidly as possible to NCBI dbSNP and not later than 45 days from completion of standard quality control practices. Non-identifying clinical and other metadata should follow the release guidelines above.
Genome Wide Association Studies Data (GWAS)
Data generated from human genomic or human genome wide association studies should be submitted as rapidly as possible to NIH dbGAP following the NIH Genomic Data Sharing Policy. It is anticipated that the data will be deposited into dbGAP within six months of data generation or at the time of publication, whichever comes first. Per NIH policy the data will be available in this controlled accessible database for access only up to a year to investigators submitting a request with a 12 month publication embargo.
Release of Other Data
Other data types not specifically addressed above, including expression data, immunological data, proteomic data, and other omics data, including unpublished data, are expected to be rapidly deposited into publicly accessible databases to include the appropriate NIAID/DMID BRC or another site designated by NIAID. These data are expected to be released within to nine months of generation and validation or upon publication, whichever comes first.
Data resulting from processing and analysis (e.g. metagenomic relative abundances) should be made available to the public within nine months of generation and validation or upon acceptance of a manuscript for publication, whichever comes first. This includes data analysis performed without data generation or limited data generation by the Center or research program. The data release plan should discuss public accessibility of the analysis data and site where such data will be housed (e.g. bioRxiv.org). It is anticipated that the appropriate NIAID/DMID BRC should house these data sets and should be discussed in the Data Sharing and Release Plans.
Sharing of Reagents and Other Resources
Investigators are encouraged to consult with NIAID Program Officers to determine which unique reagents, such as microbial strains or clones, should be deposited at the BEI Resources Repository or other approved public repositories. Resources and reagents to be shared should be released rapidly and no later than the time of publication to promote the principles expressed above. Details on sharing should be documented in the Resource Sharing plan.
For cohorts of strains that are sequenced by GCID or other NIAID-funded projects, it is anticipated that only key representative strains will be deposited at BEI and that the collaborator providing the strains to the GCID or NIAID-funded projects will contact and submit deposition forms to BEI prior to sequencing of the strain. It is expected that the following points be carefully considered prior to depositing a strain into BEI:
- Is the strain or a representative strain available and accessible in other public repositories?
- Are there strains that represent key lineages that can be selected and deposited?
The strategy and criteria for selecting strains for deposition into BEI must be outlined in the project plan. However, the centers should also ensure other ways to share additional strains with the community, if needed.
Sharing of Software, Models, and Other Resources
Software, such as data analysis tools, data modeling algorithms, systems biology models, and database schema and specifications, should be made available as Open Source, to guarantee the right of others to read, redistribute, modify, and freely use the software. Release of these resources should follow community standards including public accessibility through central repositories (e.g. GitHub, SourceForge) and should be released within nine months of validation and no later than the time of publication.