Data Science Publications

The NIAID Office of Data Science and Emerging Technologies (ODSET) highlights publications that feature innovative uses of data science and bioinformatics in infectious, immune-mediated and allergic disease research.

Explore NIAID data science publications on PubMed:

If you would like to feature a publication on this page, please contact datascience@niaid.nih.gov. Publications should feature research related to infectious, immunologic, and allergic diseases; include data science or a related discipline; and cite NIAID funding in the manuscript. Please include in your email:

  • The title of your published article.
  • A link to the article.
  • A 50-60 word description of the article. 

73 Results

Genomic analysis of progenitors in viral infection implicates glucocorticoids as suppressors of plasmacytoid dendritic cell generation

April 28, 2025
Proc Natl Acad Sci U S A

Using multi-omics analysis, the authors show that viral infection push dendritic cell progenitors away from Plasmacytoid Dendritic cell development and this seems to be influenced by Glucocorticoids. Virally infected mice lacking adrenal glands had significantly lower levels of Plasmacytoid Dendritic cells than did infected mice with adrenal glands and the corresponding glucocorticoids.

EvoWeaver: large-scale prediction of gene functional associations from coevolutionary signals

April 24, 2025
Nat Commun

This paper describes EcoWeaver, a new tool for quantifying the degree of shared evolution between genes with the goal of predicting protein function using genomic sequences. The tool combines and builds upon existing coevolutionary analysis approaches to enable use with large data sets and is available for use in the SynExtend R package.

Identification of microbial species and proteins associated with colorectal cancer by reanalyzing CPTAC proteomic datasets

April 22, 2025
Sci Rep

This study reanalyzed CPTAC colorectal cancer proteomics data using a custom metaproteomics pipeline to identify microbial species and proteins differentially abundant in tumor and normal tissues. It shows that microbial features, particularly from multiplexed data, can support tumor classification and may complement human biomarkers in colorectal cancer research.

A spectral framework to map QTLs affecting joint differential networks of gene co-expression

April 17, 2025
PLoS Comput Biol

The authors developed a statistical framework called spectral network QTL (snQTL) to identify genetic loci that influence gene co-expression networks using spectral tensor decomposition. Applying this method to three-spined stickleback data, they identified genetic regions that affect global co-expression patterns, revealing associations that would be missed by traditional eQTL methods and offering new insights into complex genotype and phenotype relationships.

Bacterial pathogen deploys the iminosugar glycosyrin to manipulate plant glycobiology

April 17, 2025
Science

This article describes glycosyrin, an iminosugar produced by Pseudomonas syringae that inhibits plant β-galactosidases to suppress immune responses. The authors uncovered its structure, biosynthetic pathway, and impact on host glycobiology, highlighting its role as a conserved bacterial virulence factor.

Partially characterized topology guides reliable anchor-free scRNA-integration

April 4, 2025
Communications Biology

The tool, scCRAFT, enables reliable single-cell RNA-seq integration by preserving confidence within-batch cell-to-cell topology through a dual-resolution triplet loss.

Why the growth of arboviral diseases necessitates a new generation of global risk maps and future projections

April 4, 2025
PLOS Computational Biology

Authors describe how current approaches to mapping arboviral diseases have become unnecessarily siloed, ignoring the strengths and weaknesses of different data types and methods. This places limits on data and model output comparability. Authors propose a new generation of risk mapping models that jointly infer risk from multiple data types.

Quantitative characterization of tissue states using multiomics and ecological spatial analysis

April 1, 2025
Nature Genetics

Multiomics and ecological spatial analysis (MESA) calculates ecodiversity-inspired metrics in spatially resolved omics integrated with single-cell data, enabling the quantitative comparison of tissue states across a range of conditions.

Putting computational models of immunity to the test—An invited challenge to predict B.pertussis vaccination responses

March 31, 2025
PLOS Computational Biology

Systems vaccinology studies have been used to build computational models that predict individual vaccine responses and identify the factors contributing to differences in outcome. Comparing such models is challenging due to variability in study designs. To address this, authors established a community resource to compare models predicting B. pertussis booster responses and generate experimental data for the explicit purpose of model evaluation.

Automatic detection and extraction of key resources from tables in biomedical papers

March 20, 2025
BioData Mining

Authors introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, “Table Transformer” models for table detection, and table structure recognition. Authors also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. 

kir-mapper: A Toolkit for Killer-Cell Immunoglobulin-Like Receptor (KIR) Genotyping From Short-Read Second-Generation Sequencing Data

March 17, 2025
HLA Immune Response Genetics

Authors present kir-mapper, a toolkit to analyse killer cell immunoglobulin-like receptor (KIR) genes from short-read sequencing, focusing on detecting KIR alleles, copy number variation, as well as SNPs and InDels in the context of the hg38 reference genome. kir-mapper can be used with whole-genome sequencing (WGS), whole-exome sequencing (WES) and sequencing data generated after probe-based capture methods.

TRain: T-cell receptor automated immunoinformatics

March 6, 2025
BMC Bioinformatics

Authors introduced an open-source tool in Python that streamlines going from full T-cell receptor (TCR) sequence information to predicted 3D T-cell receptor to peptide-Major Histocompatibility Complexes, using well-established tools. Analyzing these predicted complexes can provide deeper insights into the binding properties of TCRs, and can help shed light on one of the key steps in adaptive immune responses.

TamL is a Key Player of the Outer Membrane Homeostasis in Bacteroidota

March 5, 2025
J Mol Biol

The authors examined the role of the Translocation and Assembly Module (TAM) subunit, TamL, in Bacteroidetes species. TamL, along with interacting subunit TamB, are found to be essential in two species, including a human pathogen responsible for infections after being bitten or licked by a dog or cat (Capnocytophaga canimorsus).

Revisiting the Plasmodium falciparum druggable genome using predicted structures and data mining

March 4, 2025
npj Drug Discovery

Leveraging recent advances in protein structure prediction, authors systematically assessed the Plasmodium falciparum genome, with review eventually yielding 27 high-priority antimalarial target candidates. This study also provides a genome-wide data resource for P. falciparum and implements a generalizable framework for systematically evaluating and prioritizing novel pathogenic disease targets.

Precise mycobacterial species and subspecies identification using the PEP-TORCH peptidome algorithm

March 4, 2025
EMBO Molecular Medicine

This study introduces the PEP-TORCH Peptidome Algorithm, an innovative LC-MS/MS-based approach for the accurate, rapid, and comprehensive identification of mycobacterial species and subspecies, including co-infections, directly from liquid culture samples.

Viral genomic features predict Orthopoxvirus reservoir hosts

February 26, 2025
Communications Biology

Authors applied machine learning models incorporating both host ecological and viral genomic features to predict likely reservoirs of orthopoxviruses (OPVs). Authors demonstrated that incorporating viral genomic features in addition to host ecological traits enhanced the accuracy of potential OPV host predictions, highlighting the importance of host-virus molecular interactions in predicting potential host species. Authors also identified hotspots for geographic regions rich with potential OPV hosts. 

VaxBot-HPV: a GPT-based chatbot for answering HPV vaccine-related questions

February 19, 2025
JAMIA Open

Human Papillomavirus (HPV) vaccine is an effective measure to prevent and control the diseases caused by HPV. However, widespread misinformation and vaccine hesitancy remain significant barriers to its uptake. This study focuses on the development of VaxBot-HPV, a chatbot aimed at improving health literacy and promoting vaccination uptake by providing information and answering questions about the HPV vaccine.

A comparative study of antibiotic resistance patterns in Mycobacterium tuberculosis

February 11, 2025
Scientific Reports

This study leverages the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) to analyze over 27,000 Mycobacterium tuberculosis (MTB) genomic strains, providing a comprehensive and large-scale overview of antibiotic resistance (AMR) prevalence and resistance patterns. Authors used MTB++, which is the newest and most comprehensive AI-based MTB drug resistance profiler tool, to predict the resistance profile of each of the 27,000 MTB isolates and then used feature analysis to identify key genes that were associated with the resistance.

Conditional similarity triplets enable covariate-informed representations of single-cell data

February 9, 2025
BMC Bioinformatics

Authors introduce a novel approach for incorporating measured covariates in optimizing model parameters to ultimately specify per-sample encodings that accurately affect both immune signatures and additional clinical information.

Challenging a paradigm: Staggered versus single-pulse mass dog vaccination strategy for rabies elimination

February 7, 2025
PLOS Computational Biology

Authors constructed a stochastic, metapopulation model to examine how the timing of pulsed vaccination campaigns across patches can affect metapopulation dynamics. They explored general metapopulation dynamics for pulsed vaccinations as well as parameterizing the model for canine rabies in Arequipa, Peru, and simulated how the timing of the planned vaccination campaign, staggered over 6 months versus a single yearly pulse, affected the prospects for regional rabies elimination.

Systematic collection, annotation, and pattern analysis of viral vaccines in the VIOLIN vaccine knowledgebase

February 7, 2025
Frontiers in Cellular and Infection Microbiology

To better understand and design viral vaccines, it is critical to systematically collect, annotate, and analyse various viral vaccines and identify enriched patterns from these viral vaccines. Authors systematically collected experimentally verified viral vaccines from the literature, manually annotated, and stored the information in the VIOLIN vaccine database. Enriched patterns were identified from systematical analysis of the viral vaccines and vaccine antigens.

zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters

February 5, 2025
Nucleic Acids Research

Many universally and conditionally important genes are genomically aggregated within clusters. Here, the authors introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements, such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes.

Multi-strain phage induced clearance of bacterial infections

February 4, 2025
PLOS Computational Biology

Authors combine theory and computational models of in vivo phage therapy to study the efficacy of a phage cocktail, composed of two complementary phages motivated by the example of Pseudomonas aeruginosa facing two phages that exploit different surface receptors, LUZ19v and PAK_P1.

Leveraging public AI tools to explore systems biology resources in mathematical modeling

February 4, 2025
NPJ Systems Biology and Applications

Authors investigated the usage of public Artificial Intelligence (AI) tools in exploring systems biology resources in mathematical modeling. They tested public AI’s understanding of mathematics in models, related systems biology data, and the complexity of model structures.

RSero: A user-friendly R package to reconstruct pathogen circulation history from seroprevalence studies

February 3, 2025
PLOS Computational Biology

The authors introduce an R package, Rsero, that implements a series of serocatalytic models and estimates the force of infection (FOI, i.e., the rate at which susceptible individuals become infected) from age-stratified seroprevalence data using Bayesian methods. The package also contains a series of features to perform model comparison and visualise model fit.