Data Science Publications

The NIAID Office of Data Science and Emerging Technologies (ODSET) highlights publications that feature innovative uses of data science and bioinformatics in infectious, immune-mediated and allergic disease research.

Explore NIAID data science publications on PubMed:

If you would like to feature a publication on this page, please contact Data Science. Publications should feature research related to infectious, immunologic, and allergic diseases; include data science or a related discipline; and cite NIAID funding in the manuscript. Please include in your email:

  • The title of your published article.
  • A link to the article.
  • A 50-60 word description of the article. 

156 Results

Multi-Omics Analysis of Human Blood Cells Reveals Unique Features of Age-Associated Type 2 CD8 Memory T Cells

February 1, 2026
Aging Cell

Using a multi-omics approach, the authors identify age-related transcriptional and epigenetic changes in CD8 T cells and show age-related epigenetic changes are associated with health conditions such as asthma and type 2 diabetes.

Multicohort assessment of plasma metabolic signatures of tuberculosis disease in children: a retrospective cross-sectional study

January 23, 2026
Scientific Reports

To supplement existing diagnostic tests with suboptimal accuracy or difficult to collect samples, the authors analyzed blood plasma metabolomics in children with or without tuberculosis (TB). They present a nine-metabolite biomarker signature that shows moderate accuracy in identifying TB in children.

Genomic risk prediction of type 2 diabetes in people living with and without HIV

January 22, 2026
Scientific Reports

Using data from NIH and NIAID-funded studies, the authors measured the accuracy of polygenic risk scores models in predicting type 2 diabetes among groups with different ancestry and HIV status. They find that models incorporating multiple traits outperformed those considering a single trait, and that model performance was similar among those with and without HIV.

PURE-seq integrates FACS and PIP-seq for single-cell genomics of ultra-rare cells

January 21, 2026
Nature Communications

Pan and colleagues describe a novel sequencing method, PURE-seq, which enables enrichment of and transcriptomic characterization of very rare cells at a single cell level. They use this workflow to produce single-cell gene expression profiles of circulating tumor cells collected from patient blood.

The ratio of circulatory levels of sphingolipids to steroids predicts asthma exacerbations

January 19, 2026
Nature Communications

Using metabolomics and electronic medical records from three large asthma cohorts, the authors develop a predictive model of future asthma exacerbations. They show that ratios of sphingolipids to steroids in the blood more accurately predicts future asthma exacerbations than current clinical measures.

Multi-omics analysis of a pig-to-human decedent kidney xenotransplant

January 16, 2026
Nature

Despite efforts to improve molecular compatibility, organ transplants from other species (pig) can still trigger immune reactions resulting in transplant failure. Here the authors use multi-omics profiling characterize the immune response to xenotransplantation and identify potential targets for improving transplant success.

CAMP: a modular metagenomics analysis system for integrated multistep data exploration

January 16, 2026
NAR Genomics Bioinformatics

The authors present CAMP, Core Analysis Modular Pipeline, a modular workflow for performing metagenomic analyses. The pipeline consists of modular components enabling flexibility and analysis of intermediate files, along with semi-automated visualization of results.

TCR2HLA: Calibrated inference of HLA genotypes from TCR repertoires enables identification of immunologically relevant metaclonotypes

January 16, 2026
PLoS Computational Biology

The authors present an open-source tool, TCR2HLA, that infers human leukocyte antigen (HLA) genotype from T cell receptor (TCR) sequences. The authors use TCR2HLA to identify TCRs associated with inferred HLA genotype and SARS-CoV-2 exposure status.

SEA CDM: Study-Experiment-Assay Common Data Model and Databases for Cross-Domain Data Integration and Analysis

January 14, 2026
Scientific Data

To foster sharing and integration of various biomedical data types across experiments, the authors developed an ontology-supported Study-Experiment-Assay (SEA) common data model (CDM). They further present the Ontology-based SEA Network (OSEAN) relational database and knowledge graph and show how large number of studies from various sources can be represented and utilized.

Metabolomic profiling reveals the potential of fatty acids as regulators of exhausted CD8 T cells during chronic viral infection

January 6, 2026
PNAS

This study aimed to characterize the metabolic environment involved in CD8 T cell exhaustion that occurs from chronic infections. The authors find that levels of fatty acids increase early on in chronic infections and that administration of fatty acids late in chronic infections favored stem-like CD8 T cells.

Omics in Nonsteroidal Anti-Inflammatory Drugs-Exacerbated Respiratory Disease: Current Evidence From the Upper and Lower Airways

January 3, 2026
Allergy

This paper offers a review of studies applying omics technologies to nonsteroidal anti-inflammatory drugs (NSAID)-exacerbated respiratory disease (N-ERD) in either the upper or lower respiratory tracks. The authors propose future works utilize multi-omics techniques, experimental standardization, and characterization of both respiratory tracks in the same patients.

EXPLANA: a user-friendly workflow for EXPLoratory ANAlysis and feature selection in cross-sectional and longitudinal microbiome studies

January 2, 2026
Bioinformatics

Fouquier and co-authors have developed a feature selection workflow using machine learning methods to identify meaningful variables associated with specified outcomes from longitudinal microbiome data. The tool, available on Github, supports both categorical and numerical data and generates an interactive report of the results.

The NIAID Discovery Portal: a unified search engine for infectious and immune-mediated disease datasets

December 31, 2025
mSystems

Datasets from infectious and immune-mediated disease (IID) studies are often stored across various repositories each with different metadata schemas and search capabilities. The NIAID Data Ecosystem Discovery Portal aims to provide users with a centralized location to easily find and access IID datasets using intuitive searches and filters.

MTHFR allele and one-carbon metabolic profile predict severity of COVID-19

December 23, 2025
PNAS

Using samples from the IMmmunoPhenotyping Assessments in a COVID-19 Cohort (IMPACC) study, the authors found changes in one-carbon metabolism were predictive of disease severity. Further, the authors show that genetic status of a key gene involved in methionine synthesis and early alterations in one-carbon metabolism together were predictive of both disease severity and risk of developing long COVID.

Quantifying viral pandemic potential from experimental transmission studies

December 17, 2025
PLoS Computational Biology

Current methods of estimating pandemic risk from viruses identified in animals are limited, due in part to the high cost and low resolution of animal transmission experiments. Somsen, et al. developed a model to assess transmission and epidemiological components of pandemic risk based on viral titer data from infected animals.

A resource to empirically establish drug exposure records directly from untargeted metabolomics data

December 9, 2025
Nature Communications

To aid untargeted metabolomic studies, which can enable direct assessment of drug exposure from samples, the authors developed the Global Natural Product Social Molecular Networking (GNPS) Drug Library. This resource contains tandem mass spectrometry reference spectra for drugs and corresponding metabolites along with standardized metadata about the drugs, including therapeutic use and mechanism of action.

Assessing AI’s cognitive abilities for scientific discovery in the field of systems vaccinology

December 5, 2025
Science Immunology

Using immunological case studies, the authors assessed the ability of five large language models (LLMs) to accurately synthesize biological literature, formulate hypotheses, propose experiments to test the hypotheses, and provide broader significance to results. While the LLMs could accurately collect and synthesize existing information, they struggled to develop novel hypotheses and experiments.

Timely vaccine strain selection and genomic surveillance improve evolutionary forecast accuracy of seasonal influenza A/H3N2

December 4, 2025
Elife

Improvements in vaccine development time and the lag time from sample collection to sequencing results observed surrounding the SARS-CoV-2 pandemic may bring similar improvements to influenza vaccine development timelines. Here, the authors show realistic decreases in forecasting time produce more accurate predictions of future viral sequences and shorter sequencing turnarounds produce more accurate estimates of current clade frequencies.

HLAtools, Searching Shared HLA Amino Acid Residue Prevalence, and the Global Frequency Browsers: New Computational Resources for Working With HLA Data and Visualizing Global Patterns of HLA Variation

December 3, 2025
International Journal of Immunogenetics

Genomic diversity at the HLA region is known to play an important role in human disease, with over 41,000 known alleles found across the world. The authors here present novel open-source tools for querying, analyzing, visualizing HLA variant distributions across populations.

Peanut allergy oral immunotherapy drives single-cell multi-omic changes in peanut-reactive T cells associated with sustained unresponsiveness

December 3, 2025
Nature Immunology

To better understand how oral immunotherapy can establish continued unresponsiveness to peanut allergens, researchers analyzed single-cell multi-omics data from the POISED clinical trial cohort. They identified numerous changes among T cells that correlated with a continued lack of sensitivity to peanut allergens after stopping oral immunotherapy.

DeepRNA-Reg: a deep-learning based approach for comparative analysis of CLIP experiments

December 3, 2025
RNA Biology

To support analysis of crosslinking immunoprecipitation (CLIP) data, the authors designed an algorithm that uses deep learning to predict differentially enriched binding sites between datasets. The authors showed their algorithm to produce more accurate predictions across a variety of settings, including in microRNA regulation of T-Helper 2 cells.

HIV Pharmacology Data Repository: Setting the New Information-Sharing Standard for Clinical and Preclinical Pharmacokinetic Studies

December 3, 2025
Clinical Pharmacology and Therapeutics

The authors propose minimal information standards for pharmakinetic data to better enable data sharing and reuse. Using these standards, the authors integrated data from existing studies into a new public database, the HIV Pharmacology Data Repository.

STREAMS guidelines: standards for technical reporting in environmental and host-associated microbiome studies

December 1, 2025
Nature Microbiology

Synthesizing input from over 200 researchers, the authors provide detailed guidelines for researchers reporting on environmental and non-human host-associated microbiome studies. The guidelines aim to promote FAIR data principles and will be maintained and updated.

Virus taxonomy: the database of the International Committee on Taxonomy of Viruses

November 26, 2025
Nucleic Acids Research

The International Committee on Taxonomy of Viruses (ICTV) develops and maintains viral taxonomy as well as a public database for data access and analysis. This report describes the recent improvements made to the ICTV resources, including new tools for taxonomic analysis and visualization.

Inferring asymptomatic carriers of antimicrobial-resistant organisms in hospitals using genomic, microbiological and patient mobility data

November 19, 2025
Nature Communications

Asymptomatic carriers of antimicrobial-resistant organisms can spread these pathogens within the healthcare system, but due to their lack of symptoms have been hard to identify or predict. Here, researchers integrated multiple data types, including genomics and patient behavior, into a model that is better able to predict asymptomatic carriers.