Data Science Publications

The NIAID Office of Data Science and Emerging Technologies (ODSET) highlights publications that feature innovative uses of data science and bioinformatics in infectious, immune-mediated and allergic disease research.

Explore NIAID data science publications on PubMed:

If you would like to feature a publication on this page, please contact datascience@niaid.nih.gov. Publications should feature research related to infectious, immunologic, and allergic diseases; include data science or a related discipline; and cite NIAID funding in the manuscript. Please include in your email:

  • The title of your published article.
  • A link to the article.
  • A 50-60 word description of the article. 

67 Results

Applications of Machine Learning on Electronic Health Record Data to Combat Antibiotic Resistance

July 12, 2024
The Journal of Infectious Diseases

Advancements in computing and the accessibility of machine learning (ML) frameworks enable researchers to easily train predictive models using electronic health record data. Authors provide a primer on ML and approaches  to address common challenges, and review the use of electronic health record data to construct ML models for predicting pathogen carriage or infection, optimizing empiric therapy, and aiding antimicrobial stewardship tasks.

Biomedical Data Repository Concepts and Management Principles

June 13, 2024
Nature

This paper explores the pivotal role of data repositories in biomedical research and open science, emphasizing their importance in managing, preserving, and sharing research data. Its objective is to familiarize readers with the functions of data repositories, set expectations for their services, and provide an overview of methods to evaluate their capabilities.

Electronic health record signatures identify undiagnosed patients with common variable immunodeficiency disease

May 1, 2024
Science Translational Medicine

Common variable immunodeficiency disease (CVID) is an inborn error of immunity characterized by antibody deficiency and impaired B cell responses. This condition can be difficult to diagnose on account of its heterogenous presentation. Johnson et al. developed and validated a machine learning model designed to parse patient electronic health record data and rank individuals according to their likelihood of having CVID. Their retrospective analysis suggested that the method could help diagnose many individuals earlier than standard clinical methods.

Lack of association between classical HLA genes and asymptomatic SARS-CoV-2 infection

April 26, 2024
Human Genetics and Genomics Advances

This study shows that HLA alleles, including HLA-B15:01, are not associated with asymptomatic SARS-CoV-2 infection in two independent cohorts. These results refute previous reports. HLA alleles do not strongly influence the phenotypic outcome during the acute phase of SARS-CoV-2 infection.

Many purported pseudogenes in bacterial genomes are bona fide genes

April 15, 2024
BMC Genomics

Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly. Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly. Collectively, these results demonstrate that many pseudogenes in microbial genome assemblies are actually genes. 

Accurately clustering biological sequences in linear time by relatedness sorting

April 8, 2024
Nature Communications

Accurately clustering biological sequences is an increasingly important task but is challenging for large datasets. This study introduces a new approach called ‘relatedness sorting’ to accurately cluster sequences with linear-time scalability.

A compendium of multi-omics data illuminating host responses to lethal human virus infections

April 2, 2024
Nature

Human infections caused by viral pathogens trigger a complex gamut of host responses. In this study, the research team presented experimental methods and multi-omics data capture approaches representing the global host response to infection generated from 45 individual experiments involving human viruses from the OrthomyxoviridaeFiloviridaeFlaviviridae, and Coronaviridae families.

Advancing the scale of synthetic biology via cross-species transfer of cellular functions enabled by iModulon engraftment

March 15, 2024
Nature

iModulons, obtained through big data analysis of transcriptome compendia, describe sets of co-expressed genes that constitute independent cellular functions, suggesting that multigenic traits can be captured and transferred. Here the researchers demonstrate that this is possible through cross-species transfer of cellular functions from Pseudomonas species into E. coli.

Viral afterlife: SARS-CoV-2 as a reservoir of immunomimetic peptides that reassemble into proinflammatory supramolecular complexes

February 2, 2024
PNAS

This study shows evidence that viral peptide fragments from SARS-CoV-2, but not harmless coronavirus homologs, can “reassemble” with dsRNA into a form of proinflammatory nanocrystalline condensed matter, resulting in cooperative, multivalent immune recognition and grossly amplified inflammatory responses.

The Potential Epidemiologic, Clinical, and Economic Value of a Universal Coronavirus Vaccine: A Modelling Study

January 11, 2024
The Lancet

Using a computational model representing the United States (U.S.) population, the spread of SARS-CoV-2 and the various clinical and economic outcomes of COVID-19 such as hospitalisations, deaths, quality-adjusted life years (QALYs) lost, productivity losses, direct medical costs, and total societal costs, the researchers explore the impact of a universal vaccine under different circumstances. 

Novel machine-learning analysis of SARS-CoV-2 infection in a subclinical nonhuman primate model using radiomics and blood biomarkers

November 10, 2023
Nature

This study uses radiomics (from computed tomography images) and blood biomarkers to predict SARS-CoV-2 infection in a nonhuman primate model (NHP) with inapparent clinical disease. The researchers built machine-learning models to predict SARS-CoV-2 infection in a NHP model of subclinical disease using baseline-normalized radiomic and blood sample analyses data from SARS-CoV-2-exposed and control crab-eating macaques.

Genetically diverse mouse models of SARS-CoV-2 infection reproduce clinical variation in type I interferon and cytokine responses in COVID-19

July 25, 2023
Nature

Dynamics of type I interferon (IFN) following infection with SARS-CoV-2 are critical in determining disease severity in humans but have been difficult to model in mice. Here, infection of genetically diverse mice reveals how delayed or immediate IFN signaling coordinates antiviral immunity.

Tracking B cell responses to the SARS-CoV-2 mRNA-1273 vaccine

July 25, 2023
Cell Reports

Using multiomic single-cell analyses, the authors show a coordinated trajectory involving plasmablasts and activated and resting memory B cells in response to primary SARS-CoV-2 mRNA vaccination. Spike-specific BCR repertoire analysis shows incremental affinity maturation across the 6-month study period and reveals evidence of convergence among study participants and other cohorts.

Co-expression of Foxp3 and Helios facilitates the identification of human T regulatory cells in health and disease

June 7, 2023
Frontiers in Immunology

In vivo studies in humanized mice and in patients demonstrate that Foxp3 expression is not upregulated in human CD4+ T conventional cells when activated under a variety of inflammatory conditions. Thus, Foxp3 expression alone can be used as a marker for bona fide T regulatory cells in vivo. The combination of Foxp3 and Helios should be mandatory for quantification of Treg that have been expanded in vitro for use in cellular biotherapy or for production of CAR-Treg.

Variable Selection for High-Dimensional Nodal Attributes in Social Networks with Degree Heterogeneity

April 13, 2023
Journal of the American Statistical Association

Researchers considered a class of network models, in which the connection probability depends on ultrahigh-dimensional nodal covariates (homophily) and node-specific popularity (degree heterogeneity). A Bayesian method is proposed to select nodal features in both dense and sparse networks under a mild assumption on popularity parameters. The proposed approach is implemented via Gibbs sampling.

Developing a standardized but extendable framework to increase the findability of infectious disease datasets

February 23, 2023
Scientific Data

Biomedical datasets are increasing in size, stored in many repositories, and face challenges in FAIRness (findability, accessibility, interoperability, reusability). To improve FAIRness of their datasets and computational tools, authors representing infectious disease researchers from 15 centers evaluated metadata standards across established biomedical data repositories, created a reusable metadata schema based on Schema.org and catalogued nearly 400 datasets and computational tools.

The Immune Signatures data resource, a compendium of systems vaccinology datasets

October 20, 2022
Nature

The NIH/NIAID Human Immunology Project Consortium (HIPC) has leveraged systems immunology approaches to identify molecular signatures associated with the immunogenicity of many vaccines. To support comparative analyses across different vaccines, the authors created the Immune Signatures Data Resource, a compendium of standardized systems vaccinology datasets.