FAIR Data Principles at NIH and NIAID

The FAIR data principles are a set of guidelines aimed at improving the Findability, Accessibility, Interoperability, and Reusability of digital assets. By adopting these principles, NIH and NIAID are paving the way for a more open and innovative research ecosystem that propels scientific discovery to improve public health.

Using the NIAID Data Ecosystem Discovery Portal to Search Across Data Repositories

Data Science Dispatch |

NIAID has developed a platform to help researchers find data related to infectious and immune-mediated disease (IID) across multiple data repositories. The NIAID Data Ecosystem Discovery Portal is a centralized hub cataloging millions of datasets from over 50 sources.

Researchers can use the Discovery Portal to find data, resources, and computational tools from different repositories. This can save them time otherwise spent combing through multiple sources and help them find datasets they weren’t aware of previously.

The Discovery Portal includes resources from IID and generalist repositories. Representative resources include NIAID-sponsored repositories such as AccessClinicalData@NIAID, ImmPort, and VDJServer, as well as repositories funded outside of NIAID but relevant to IID research. Resources in the Discovery Portal include a diverse array of data types spanning multiple domains of IID research, including -omics data, clinical data, epidemiological data, pathogen-host interaction data, flow cytometry, imaging, and other experimental data.

The Discovery Portal supports NIAID objectives of maximizing the impact of scientific data, reducing duplication of efforts in research, and promoting data reuse, data transparency and compliance with data-sharing policies. The portal aligns with many of the principles of findable, accessible, interoperable, and reusable (FAIR) data practices by making data easier to find and access.

Using metadata to drive discovery

The NIAID Data Ecosystem Discovery Portal does not contain data itself. Instead, it contains detailed information about IID datasets and resources drawn from metadata. Users can then access the resources through external links.

The portal uses metadata to support several key features:

  • Search and Discovery: Users can rapidly search millions of datasets across both IID and generalist repositories using the Search or Advanced Search options. Metadata categories such as funding source, repository, and conditions of access help filter search results and identify relevant research data.
  • Metadata Compatibility: Each individual dataset in the Discovery Portal has a “metadata compatibility score,” which displays specific metadata elements collected for a given resource.  Additionally, the Discovery Portal has metadata compatibility visualizations which capture the breadth of metadata at the repository level. This information can help researchers and data contributors quickly understand a repository’s metadata structure, aiding in decisions about where to deposit or retrieve resources.
  • Downloadable Metadata: The portal has buttons that allow users to download metadata to perform meta-analyses.

The Discovery Portal is working to fill missing or incomplete metadata fields (such as Pathogen Species, Health Condition, and Host Species) by augmenting and standardizing metadata fields to provide more of this necessary information for users.

New Program Collection tool and other features

One of the new features of the NIAID Data Ecosystem Discovery Portal is the “Program Collection” filter. These are groups of datasets contributed by specialized NIAID research programs and initiatives. The Discovery Portal displays the Program Collection filter on the search page, and current efforts are focused on expanding Program Collection data.

The Program Collection filter allows researchers to discover high-quality, program-specific data relevant to their area of interest and find collections that align with the broader objectives of NIAID’s strategic research efforts. The feature also amplifies the scientific contributions of participating networks and increases the likelihood of researchers using these datasets. 

Using the Sources page of the Discovery Portal can also help researchers and data providers make informed decisions about different repositories where they can deposit their data.

The Discovery Portal is now connected to National Center for Biotechnology Information (NCBI) databases through NCBI LinkOut. When NCBI database content is linked to data described in the Portal, a link to the related Portal entry can be found on the NCBI page.

Learn more by visiting the Discovery Portal, reviewing the Getting Started page, and exploring the Knowledge Center

U44 SBIR Phase II Clinical Trial Implementation Award SOP

U01 Investigator-Initiated Clinical Trial Award SOP

Extended R01 Investigator-Initiated Clinical Trial Award SOP

R01 and R21 Investigator-Initiated Clinical Trial Award SOP

R34 Clinical Trial Planning Award SOP

Understanding Metadata: A Key to Data Sharing and Reuse

Data Science Dispatch |

Metadata plays a crucial role in sharing and reusing scientific data. Understanding what metadata is and how it is used can accelerate your research and increase the visibility of your work. It can also help to advance the field of infectious and immune-mediated disease (IID) research.

What is metadata?

Metadata is data about data. It provides additional information to help people understand the data, such as its origin, structure, and context. 

For example, for a genome sequence, the data is the actual sequence of nucleotides. The metadata is the author of the data, the date the data was collected, the measurement techniques used, the health condition at the focus of dataset (like asthma or autoimmune diseases), and more. You can see another example of data versus metadata in the video on the right (data management and sharing webinar from the National Institute of Diabetes and Digestive and Kidney Diseases, 4:22-6:28).

Examples of common metadata elements that describe IID research data are available at the NIAID Data Ecosystem’s list of common fundamental and recommended metadata elements

Why is metadata important? When you share scientific data, metadata provides the context that allows others to understand, trust, reproduce, or reuse data. This is particularly important in studies or secondary analyses where data is integrated from multiple sources; comprehensive metadata enables a scientist to combine data from different sources.

Using metadata effectively can also help your data get discovered, reused, and cited—thereby maximizing the value and impact of your research.

Collecting rich metadata during research

Effective metadata use starts with collecting rich metadata throughout the research process. “Rich” metadata is detailed and structured, making it easier for people to quickly learn about your data. 

Including standardized formats and schemas makes it clear which metadata components are present and where they can be found. Using common terminologies, ontologies, and data formats takes this a step further by defining specific metadata elements for both people and computers. Machine-readable metadata allows users to learn about and use data using code, helping them quickly learn about many data files.

Some common examples of collecting metadata in a structured way include defining standardized date and time formats and using ORCID IDs for authors to ensure precise identification.

Biomedical researchers can follow some basic steps to ensure that they are collecting comprehensive and standardized metadata. 

1. Determine necessary metadata content and formats

Collecting data in the format you intend to share it in is more efficient than reformatting everything at the end. Here are some questions to help you determine data and metadata formats:

  • Who will use these data and how will they use it? What information do they need to understand the data?
  • Many research areas have standardized metadata formats that researchers can follow. What metadata standards or schemas do other researchers in your field use? Would using these standards and schemas help researchers understand and reuse these data?
  • Does the target repository or scientific journal have any specific metadata or formatting requirements? If the repository where you plan to share your data has specific guidance, follow that guidance from the start of your research.

2. Create metadata throughout the data lifecycle

Before data collection, collect protocol documentation and set up systems for data and metadata collection. These systems can collect information using the standards, formats, vocabularies, and ontologies selected, and will save you time when preparing data and metadata for publication.

During the data collection phase, document anything that fits into the target metadata fields. These may include the dates data was collected, variables measured, the units of measurement, the instruments used, and the conditions under which the data was collected.

After data collection, add any remaining metadata elements from your plan. These elements may focus more on describing data processing steps, versioning, authors, or related topics. 

3. Prepare to share data and metadata 

Verify that metadata meets requirements for where you would like to share your data, and add any elements that you may finalize late in the data lifecycle, like associated publications, license for reuse, or a data author list prior to sharing.

Throughout the process, you can seek guidance from your program officer or the repositories where you intend to share data to ensure that metadata is collected and shared effectively.

Sharing data and metadata

The NIH Data Management and Sharing Policy encourages sharing metadata that describes or supports your scientific data. NIH recommends data management and sharing practices consistent with the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles, and it strongly encourages the use of established data repositories for preserving and sharing data

In some instances, the full scientific data cannot be shared easily. This may be due to large file sizes — particularly with imaging-related research — or data privacy regulations. However, even if the actual scientific data cannot be shared, sharing metadata is still valuable. This practice ensures that there is a public record of the data's existence and provides important background information that can be used by other researchers.

Metadata is also a powerful tool for finding scientific data in repositories. Researchers can use metadata to search for data sets that match specific criteria. One tool that can help researchers find relevant data is the NIAID Data Ecosystem Discovery Portal, which uses metadata present in data stored in repositories to search across over 50 different IID repositories and data sources. 

Learn more about developing a data management and sharing plan and compliance with relevant NIH data sharing policies by reviewing the Data Policy and Guidance page

Facilitating Data Harmonization Across an International HIV Program

Data Science Dispatch |

Scientists supported by NIAID are helping an international research consortium harmonize HIV data from around the world. 

The International Epidemiology Databases to Evaluate AIDS (IeDEA) collects observational data representing over 2.2 million people living with and at risk for HIV. This international research consortium, established by NIH in 2006, collects data from clinical centers and research groups in seven geographic regions — which include 44 countries across five continents.

IeDEA networks combine de-identified health data from regional databases in multiple parts of the world for approved multiregional analyses. This helps answer HIV research questions that individual studies cannot address. 

However, harmonizing datasets from different regions presents many challenges. Datasets from each region may be in different formats or languages, and regions are subject to different data-sharing regulations.

When researchers request data for multiregional studies, data managers are tasked with selecting data that match the study’s inclusion and exclusion criteria and mapping the requested data to the IeDEA Data Exchange Standard (DES). This process has historically required significant effort, which can result in delays in sending the data and challenges with subsequent analysis of the standardized data.

Developing the Harmonist Toolkit

To make it easier to harmonize data from multiple regions, NIAID-supported informatics specialists developed the Harmonist Data Toolkit. The Harmonist Toolkit is a web-based application that checks for data quality and DES conformance, displays possible errors for data managers to address, and generates data reports. Once a dataset meets the requisite criteria, the Harmonist Toolkit can submit the dataset to the requesting researcher.

Researchers worked with IeDEA’s regional data managers to develop and implement the Harmonist Toolkit, which launched in 2019. After a year of using the Toolkit, data managers and researchers reported that the Toolkit improved the quality of datasets, generated useful reports, and simplified the task of linking datasets to the DES. High data quality improves trust in study results, leading to greater impact on patient care and health policy.

The Harmonist Toolkit is built using the R/Shiny framework, but as a web-based application it does not require coding knowledge for data managers to use. It can be hosted on a cloud-based server or locally on a laptop or desktop computer. 

Stephany Duda, Ph.D., an associate professor of biomedical informatics at the Vanderbilt University School of Medicine, is the primary investigator for the Harmonist project. She said that a key to building the Harmonist Toolkit was continually involving data managers at the different regional sites. 

“Anything that I design should make their lives easier and should make it easier for them to develop datasets that are standardized and adhere to best practices,” Dr. Duda said.

Dr. Duda, along with lead author Dr. Judith Lewis and other members of the Harmonist team, published a paper in the Journal of Biomedical Informatics describing the results of the project in 2023. 

“The datasets are never perfect. That’s just the nature of clinical observational research data,” Dr. Duda said. “But we have seen, as reported in the paper, a substantial downward trend in the number of errors that we’ve detected in these datasets.”

Applying the Harmonist framework to other international research

The Harmonist team is continuing to improve the IeDEA Harmonist Data Toolkit — while also looking into adapting it for use beyond the IeDEA consortium. 

The project is currently supporting harmonization efforts for the Regional Prospective Observational Research for Tuberculosis (RePORT) International consortium, which studies tuberculosis (TB) in the context of HIV and is supported by NIAID. The team is also collaborating with other consortia to create a generalized version of the Toolkit code. 

“There's so much work that still needs to be done to make international research more accessible for everybody,” Dr. Duda said. “It’s rewarding to have this opportunity to build resources that support other researchers.”

Harmonist is supported by NIAID’s Division of AIDS. Learn more by visiting RePORTER

IeDEA is supported by NIAID as well as the Eunice Kennedy Shriver National Institute of Child Health and Human Development, the National Cancer Institute, the National Institute of Mental Health, the National Institute on Drug Abuse, the National Heart, Lung, and Blood Institute, the National Institute on Alcohol Abuse and Alcoholism, the National Institute of Diabetes and Digestive and Kidney Diseases, the Fogarty International Center, and the National Library of Medicine. Learn more about NIAID’s support for IeDEA

Women face unique health problems related to many NIAID mission areas—specifically, HIV/AIDS, sexually transmitted infections, and autoimmune disorders. Many infectious and autoimmune diseases affect female populations disproportionately. For example, genital herpes from herpes simplex virus 2 is nearly twice as common among women as among men. Likewise, women account for more cases of chlamydia, lupus, and scleroderma than do men.

Even diseases that strike men and women in nearly equal numbers may have unique consequences or complications for women. For instance, women with HIV are at higher risk of severe cases of gynecological problems, such as chlamydia or bacterial vaginosis, than are non-infected women. Women also risk passing some of these diseases to children during pregnancy or breastfeeding.

The National Institutes of Health (NIH) created the women’s health research category in 1994 for annual budgeting purposes and in 2019 it was updated to include the following categories:

  • Studies with only female participants
  • Diseases or health conditions unique to women
  • Disease or conditions that predominantly affect women or girls
  • Research with an overall goal of examining women’s health outcomes, trajectories, risk factors, diagnosis or treatment strategies, or health differences between women and men
  • Career development, training, and meeting grants related to fostering the women’s health research workforce

Related Public Health and Government Information

To learn about risk factors for diseases that specifically affect women and current prevention and treatment strategies visit the MedlinePlus Women’s Health site.

collage of three images: a female doctor giving a girl a vaccine, a woman scientist, a pregnant woman.
Image of three women scientists in a lab
Credit: NIAID
Womens Health
Page Summary
Women face unique health problems related to many NIAID mission areas—specifically, HIV/AIDS, sexually transmitted infections, and autoimmune disorders. Many infectious and autoimmune diseases affect female populations disproportionately.

Highlights

Research Area Type
Disciplines & Approaches