NIAID has made a significant investment in genomic-related activities that provide comprehensive genomic, functional genomics, bioinformatics, structural genomics, proteomics and integrated "omics" data sets, resources and reagents to the scientific community for basic and applied research in infectious diseases. This wealth of genomics and other data sets, as well as the availability of the human genome, provides a valuable and critical resource for the scientific community.
This document serves to provide general guiding principles and specific guidelines to prepare and establish consistent data release plans across NIAID/DMID Omics Centers, including Genomic Sequencing Centers for Infectious Diseases (GSCID) and other NIAID-funded large-scale centers and projects. NIAID acknowledges the variety of projects among the centers, but at the same time, considers it of the highest importance to develop guidelines that are flexible enough to achieve rapid data release and to be sensitive to the aims of the centers and their individual projects. Continued discussions of the data release guidelines will be an ongoing function of the centers, Scientific Working Groups, scientific community, and NIAID. Modifications to NIAID Data Release Guidelines and data release plans may be needed as a result of these discussions and other data release guidelines developed by the National Institutes of Health (NIH).
Rapid and unrestricted sharing of data and research resources is essential for advancing research on human health and infectious diseases. The utility of the generated data to the scientific community is largely dependent on how quickly these data can be deposited into public databases and accessible by the scientific community. NIAID is committed to rapid release of genomic and other data types and in addition, recognizes that clinical data and other metadata associated with the genomic, omics, and other data are valuable research resources. For these reasons, NIAID endorses rapid release of all these data sets and it is anticipated that data generated will be made freely available via deposition into publicly accessible and searchable international databases as GenBank and National Center for Biotechnology Information (NCBI) and to the NIAID-funded databases such as DMID Bioinformatics Resource Center or other databases designated and approved by NIAID.
The users of any released data are expected to act responsibly to recognize the scientific contribution of the data generators/producers by following normal standards of scientific etiquette and fair use of unpublished data. Such guidelines can be found in Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility, and the Toronto Data Release Workshop (Nature 461, 168-170 (10 September 2009) | doi: 10.1038/461168a; Published online 9 September 2009).
Data Sharing and Release Plans
Data sharing and release plans should be based on guidelines outlined in this document for projects designated by NIAID to rapidly share data for public access. Plans will be reviewed and approved by NIAID.
All raw genome data and next generation sequencing data should be submitted as rapidly as possible to either the Trace Archive or, as appropriate, to the Short Read Archive at the National Center for Biotechnology Information (NCBI)/National Library of Medicine/NIH. These data should also include information on templates, vectors, and quality values for each sequence, as appropriate. This includes RNA seq-transcriptomics data obtained from next generation sequencing.
Genome and metagenomics full and partial assemblies and their annotations should be deposited in appropriate databases at NCBI after verification by the center or data generator. Assuming no specific errors are detected during the validation process, final assemblies and final annotations should be submitted to GenBank for individual samples or for defined cohorts of samples as rapidly as possible and no later than 45 calendar days of being generated, followed by release to other web sites, as approved by NIAID.
It is expected that GenBank records for genome assemblies and annotation contain language to acknowledge the funding source and the joint ownership of the Genbank records by the NIAID-funded GSCID and the Bioinformatics Resource Centers or database reviewed and approved by NIAID. NIAID recommends the following language to be added to the COMMENT field of Genbank records: “This work was supported by the National Institute of Allergy and Infectious Diseases (NIAID), Genome Sequencing Centers for Infectious Diseases (GSCID) program. This record is co-owned by the NIAID-funded GSCID and the Bioinformatics Resource Centers,” or another center as designated by NIAID.
In unusual cases and as agreed upon by NIAID, the Institute will consider minimal delay in release of sequence data to NCBI. Delayed data release for sequence data should be discussed and justified in the data sharing and release plan submitted to NIAID and would require NIAID approval.
Clinical Data and Other Metadata
NIAID also expects that relevant metadata (clinical data or any other type of data) that are essential for the biological interpretation of genome sequence data and other omics and experimental data sets will be made available to the scientific community as rapidly as possible, and at the same time that the genomic, omics, or other generated data types are submitted, through a publicly accessible database such as the appropriate NIAID Bioinformatics Resource Center, or other databases designated by NIAID such as the NCBI dbGAP. It is expected that a data release plan for metadata will be defined prior to the initiation of data generation and will be agreed upon by NIAID. The plan will include 1) a list of metadata to be released, 2) the database(s) they will be released to, and 3) timelines of data release.
In unusual cases and as agreed upon by NIAID, release of the metadata can be delayed. It is expected in this case that the metadata will be submitted to an NIAID Bioinformatics Resource Center, or other databases designated by NIAID such as the NCBI dbGAP at the same time that the genomics, omics, or other generated data types are submitted for public access to NCBI or database designated by NIAID. These metadata will be embargoed at an NIAID Bioinformatics Resource Center, or other databases designated by NIAID such as the NCBI dbGAP, for up to nine months or upon publication, whichever comes first and as agreed upon by NIAID.
Release of Patient/Donor Identifying Data
NIAID has sought advice on human subjects' privacy protection issues related to releasing human clinical data and established an external Working Group consisting of scientists with expertise in clinical research data management and infectious diseases. Points to consider when sharing and releasing clinical metadata were developed and are described here to assist reviewing and identifying clinical or other metadata fields that may potentially identify human subjects.
The rights and privacy of human subjects who participate in clinical research studies shall be protected at all times. Clinical metadata, genomic, or other data sets, or a subset of the clinical and other metadata that may potentially identify human subjects of samples shall not be released in openly accessible public databases. In some cases, potentially identifying data may be deposited in a controlled access database as designated by NIAID.
Clinical metadata and other fields that may potentially, uniquely identify an individual should be carefully reviewed and identified prior to sharing and releasing any clinical metadata to openly accessible public databases. Eighteen data elements defined by the Health Insurance Portability and Accountability Act of 1996 (HIPAA) safe harbor standard must be considered in this review. It is recognized that even with a careful and comprehensive review of the clinical metadata fields, there may be a risk of re-identification.
Single nucleotide polymorphisms (SNP) should be submitted as rapidly as possible to NCBI dbSNP and not later than 45 days from completion of standard quality control practices. Non-identifying clinical and other metadata should follow the release guidelines above.
Genome Wide Association Studies Data (GWAS)
Data generated from human genome wide association studies should be submitted as rapidly as possible to NIH dbGAP following the NIH policy on GWAS data deposition. It is anticipated that the data will be deposited into dbGAP within six months of data generation and per NIH policy the data will be available in this controlled accessible database for access only up to a year to investigators submitting a request with a 12 month publication embargo.
Release of Other Data
Other data types not specifically addressed above, including expression data, immunological data, proteomic data, and other omics data, including unpublished data, are expected to be rapidly deposited into a publicly accessible website(s) to include the appropriate NIAID BRC or another site designated by NIAID.
In some cases, NIAID will consider minimal delay up to nine months or upon publication, whichever comes first, in data release of other data types. Delayed data release for these other types of data should be discussed in the data release plan submitted to NIAID and would require NIAID approval.
Analysis performed should be made available to the public upon acceptance of a manuscript for publication or within one year of generation, whichever comes first. This includes data analysis performed without data generation or limited data generation by the Center or research program. The data release plan should discuss public accessibility of the analysis data and site where such data will be housed. It is anticipated that the appropriate NIAID Bioinformatics Resource Centers should house these data sets and should be discussed in the data release plans.
Sharing of Reagents and Other Resources
Reagents, such as microbial strains to be sequenced or clones, should be deposited at the BEI Resources Repository or other approved public repositories. Other resources and reagents to be shared should be released rapidly to promote the principles expressed above and documented in the data and reagent and resources release plan.
For strains that are sequenced by GSCID or other NIAID-funded projects, it is anticipated that the strain will be deposited at BEI and that the collaborator providing the strain to the GSCID or NIAID-funded projects will contact and submit deposition forms to BEI prior to sequencing of the strain. It is expected that the following points be carefully considered prior to depositing a strain into BEI:
- Is the strain or a representative strain available and accessible in other public repositories?
- Can strains be selected and deposited that represent key lineagesof the strains to be sequenced?
The strategy and criteria for selecting strains for deposition into BEI must be outlined in the project plan.
Software, such as data analysis tools, data modeling algorithms, and database schema and specifications, should be made available as Open Source, to guarantee the right of others to read, redistribute, modify, and freely use the software.