Precise mycobacterial species and subspecies identification using the PEP-TORCH peptidome algorithm

This study introduces the PEP-TORCH Peptidome Algorithm, an innovative LC-MS/MS-based approach for the accurate, rapid, and comprehensive identification of mycobacterial species and subspecies, including co-infections, directly from liquid culture samples.

Link Type

Data-Sci-Publications

Publish or Event Date

Tue, 03/04/2025 - 12:00pm

Link URL

https://pmc.ncbi.nlm.nih.gov/articles/PMC11982334/

Short Title

Precise mycobacterial species and subspecies identification

Read more about Precise mycobacterial species and subspecies identification using the PEP-TORCH peptidome algorithm

Content Coordinator

Lisa Mayer

Content Manager

Reed Shabman

Publication Source

EMBO Molecular Medicine

kir-mapper: A Toolkit for Killer-Cell Immunoglobulin-Like Receptor (KIR) Genotyping From Short-Read Second-Generation Sequencing Data

Authors present kir-mapper, a toolkit to analyse killer cell immunoglobulin-like receptor (KIR) genes from short-read sequencing, focusing on detecting KIR alleles, copy number variation, as well as SNPs and InDels in the context of the hg38 reference genome. kir-mapper can be used with whole-genome sequencing (WGS), whole-exome sequencing (WES) and sequencing data generated after probe-based capture methods.

Link Type

Data-Sci-Publications

Publish or Event Date

Mon, 03/17/2025 - 12:00pm

Link URL

https://doi.org/10.1111/tan.70092

Short Title

kir-mapper: A Toolkit for Killer-Cell Immunoglobulin-Like

Read more about kir-mapper: A Toolkit for Killer-Cell Immunoglobulin-Like Receptor (KIR) Genotyping From Short-Read Second-Generation Sequencing Data

Content Coordinator

Lisa Mayer

Content Manager

Reed Shabman

Publication Source

HLA Immune Response Genetics

Automatic detection and extraction of key resources from tables in biomedical papers

Authors introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, “Table Transformer” models for table detection, and table structure recognition. Authors also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables.

Link Type

Data-Sci-Publications

Publish or Event Date

Thu, 03/20/2025 - 12:00pm

Link URL

https://pmc.ncbi.nlm.nih.gov/articles/PMC11924859/

Short Title

Automatic detection and extraction of key resources from

Read more about Automatic detection and extraction of key resources from tables in biomedical papers

Content Coordinator

Lisa Mayer

Content Manager

Reed Shabman

Publication Source

BioData Mining

Putting computational models of immunity to the test—An invited challenge to predict B.pertussis vaccination responses

Systems vaccinology studies have been used to build computational models that predict individual vaccine responses and identify the factors contributing to differences in outcome. Comparing such models is challenging due to variability in study designs. To address this, authors established a community resource to compare models predicting B. pertussis booster responses and generate experimental data for the explicit purpose of model evaluation.

Link Type

Data-Sci-Publications

Publish or Event Date

Mon, 03/31/2025 - 12:00pm

Link URL

https://pmc.ncbi.nlm.nih.gov/articles/PMC11978014/

Short Title

Putting computational models of immunity to the test—An

Read more about Putting computational models of immunity to the test—An invited challenge to predict B.pertussis vaccination responses

Content Coordinator

Lisa Mayer

Content Manager

Reed Shabman

Publication Source

PLOS Computational Biology

Quantitative characterization of tissue states using multiomics and ecological spatial analysis

Multiomics and ecological spatial analysis (MESA) calculates ecodiversity-inspired metrics in spatially resolved omics integrated with single-cell data, enabling the quantitative comparison of tissue states across a range of conditions.

Link Type

Data-Sci-Publications

Publish or Event Date

Tue, 04/01/2025 - 12:00pm

Link URL

https://pmc.ncbi.nlm.nih.gov/articles/PMC11985343/

Short Title

Quantitative characterization of tissue states using

Read more about Quantitative characterization of tissue states using multiomics and ecological spatial analysis

Content Coordinator

Lisa Mayer

Content Manager

Reed Shabman

Publication Source

Nature Genetics

Why the growth of arboviral diseases necessitates a new generation of global risk maps and future projections

Authors describe how current approaches to mapping arboviral diseases have become unnecessarily siloed, ignoring the strengths and weaknesses of different data types and methods. This places limits on data and model output comparability. Authors propose a new generation of risk mapping models that jointly infer risk from multiple data types.

Link Type

Data-Sci-Publications

Publish or Event Date

Fri, 04/04/2025 - 12:00pm

Link URL

https://pmc.ncbi.nlm.nih.gov/articles/PMC11970912/

Short Title

Why the growth of arboviral diseases necessitates a new

Read more about Why the growth of arboviral diseases necessitates a new generation of global risk maps and future projections

Content Coordinator

Lisa Mayer

Content Manager

Reed Shabman

Publication Source

PLOS Computational Biology

Partially characterized topology guides reliable anchor-free scRNA-integration

The tool, scCRAFT, enables reliable single-cell RNA-seq integration by preserving confidence within-batch cell-to-cell topology through a dual-resolution triplet loss.

Link Type

Data-Sci-Publications

Publish or Event Date

Fri, 04/04/2025 - 12:00pm

Link URL

https://pmc.ncbi.nlm.nih.gov/articles/PMC11971424/

Short Title

Partially characterized topology guides reliable anchor-free

Read more about Partially characterized topology guides reliable anchor-free scRNA-integration

Content Coordinator

Lisa Mayer

Content Manager

Reed Shabman

Publication Source

Communications Biology

Using the NIAID Data Ecosystem Discovery Portal to Search Across Data Repositories

Data Science Dispatch | April 14, 2025

a doctor pointing to science based icons on a screen

NIAID has developed a platform to help researchers find data related to infectious and immune-mediated disease (IID) across multiple data repositories. The NIAID Data Ecosystem Discovery Portal is a centralized hub cataloging millions of datasets from over 50 sources.

Researchers can use the Discovery Portal to find data, resources, and computational tools from different repositories. This can save them time otherwise spent combing through multiple sources and help them find datasets they weren’t aware of previously.

The Discovery Portal includes resources from IID and generalist repositories. Representative resources include NIAID-sponsored repositories such as AccessClinicalData@NIAID, ImmPort, and VDJServer, as well as repositories funded outside of NIAID but relevant to IID research. Resources in the Discovery Portal include a diverse array of data types spanning multiple domains of IID research, including -omics data, clinical data, epidemiological data, pathogen-host interaction data, flow cytometry, imaging, and other experimental data.

The Discovery Portal supports NIAID objectives of maximizing the impact of scientific data, reducing duplication of efforts in research, and promoting data reuse, data transparency and compliance with data-sharing policies. The portal aligns with many of the principles of findable, accessible, interoperable, and reusable (FAIR) data practices by making data easier to find and access.

Using metadata to drive discovery

The NIAID Data Ecosystem Discovery Portal does not contain data itself. Instead, it contains detailed information about IID datasets and resources drawn from metadata. Users can then access the resources through external links.

The portal uses metadata to support several key features:

Search and Discovery: Users can rapidly search millions of datasets across both IID and generalist repositories using the Search or Advanced Search options. Metadata categories such as funding source, repository, and conditions of access help filter search results and identify relevant research data.
Metadata Compatibility: Each individual dataset in the Discovery Portal has a “metadata compatibility score,” which displays specific metadata elements collected for a given resource. Additionally, the Discovery Portal has metadata compatibility visualizations which capture the breadth of metadata at the repository level. This information can help researchers and data contributors quickly understand a repository’s metadata structure, aiding in decisions about where to deposit or retrieve resources.
Downloadable Metadata: The portal has buttons that allow users to download metadata to perform meta-analyses.

The Discovery Portal is working to fill missing or incomplete metadata fields (such as Pathogen Species, Health Condition, and Host Species) by augmenting and standardizing metadata fields to provide more of this necessary information for users.

New Program Collection tool and other features

One of the new features of the NIAID Data Ecosystem Discovery Portal is the “Program Collection” filter. These are groups of datasets contributed by specialized NIAID research programs and initiatives. The Discovery Portal displays the Program Collection filter on the search page, and current efforts are focused on expanding Program Collection data.

The Program Collection filter allows researchers to discover high-quality, program-specific data relevant to their area of interest and find collections that align with the broader objectives of NIAID’s strategic research efforts. The feature also amplifies the scientific contributions of participating networks and increases the likelihood of researchers using these datasets.

Using the Sources page of the Discovery Portal can also help researchers and data providers make informed decisions about different repositories where they can deposit their data.

The Discovery Portal is now connected to National Center for Biotechnology Information (NCBI) databases through NCBI LinkOut. When NCBI database content is linked to data described in the Portal, a link to the related Portal entry can be found on the NCBI page.

Learn more by visiting the Discovery Portal, reviewing the Getting Started page, and exploring the Knowledge Center.

Understanding Metadata: A Key to Data Sharing and Reuse

Data Science Dispatch | March 14, 2025

Metadata plays a crucial role in sharing and reusing scientific data. Understanding what metadata is and how it is used can accelerate your research and increase the visibility of your work. It can also help to advance the field of infectious and immune-mediated disease (IID) research.

What is metadata?

Metadata is data about data. It provides additional information to help people understand the data, such as its origin, structure, and context.

For example, for a genome sequence, the data is the actual sequence of nucleotides. The metadata is the author of the data, the date the data was collected, the measurement techniques used, the health condition at the focus of dataset (like asthma or autoimmune diseases), and more. You can see another example of data versus metadata in the video on the right (data management and sharing webinar from the National Institute of Diabetes and Digestive and Kidney Diseases, 4:22-6:28).

Examples of common metadata elements that describe IID research data are available at the NIAID Data Ecosystem’s list of common fundamental and recommended metadata elements.

Why is metadata important? When you share scientific data, metadata provides the context that allows others to understand, trust, reproduce, or reuse data. This is particularly important in studies or secondary analyses where data is integrated from multiple sources; comprehensive metadata enables a scientist to combine data from different sources.

Using metadata effectively can also help your data get discovered, reused, and cited—thereby maximizing the value and impact of your research.

Collecting rich metadata during research

Effective metadata use starts with collecting rich metadata throughout the research process. “Rich” metadata is detailed and structured, making it easier for people to quickly learn about your data.

Including standardized formats and schemas makes it clear which metadata components are present and where they can be found. Using common terminologies, ontologies, and data formats takes this a step further by defining specific metadata elements for both people and computers. Machine-readable metadata allows users to learn about and use data using code, helping them quickly learn about many data files.

Some common examples of collecting metadata in a structured way include defining standardized date and time formats and using ORCID IDs for authors to ensure precise identification.

Biomedical researchers can follow some basic steps to ensure that they are collecting comprehensive and standardized metadata.

1. Determine necessary metadata content and formats

Collecting data in the format you intend to share it in is more efficient than reformatting everything at the end. Here are some questions to help you determine data and metadata formats:

Who will use these data and how will they use it? What information do they need to understand the data?
Many research areas have standardized metadata formats that researchers can follow. What metadata standards or schemas do other researchers in your field use? Would using these standards and schemas help researchers understand and reuse these data?
Does the target repository or scientific journal have any specific metadata or formatting requirements? If the repository where you plan to share your data has specific guidance, follow that guidance from the start of your research.

2. Create metadata throughout the data lifecycle

Before data collection, collect protocol documentation and set up systems for data and metadata collection. These systems can collect information using the standards, formats, vocabularies, and ontologies selected, and will save you time when preparing data and metadata for publication.

During the data collection phase, document anything that fits into the target metadata fields. These may include the dates data was collected, variables measured, the units of measurement, the instruments used, and the conditions under which the data was collected.

After data collection, add any remaining metadata elements from your plan. These elements may focus more on describing data processing steps, versioning, authors, or related topics.

3. Prepare to share data and metadata

Verify that metadata meets requirements for where you would like to share your data, and add any elements that you may finalize late in the data lifecycle, like associated publications, license for reuse, or a data author list prior to sharing.

Throughout the process, you can seek guidance from your program officer or the repositories where you intend to share data to ensure that metadata is collected and shared effectively.

Sharing data and metadata

The NIH Data Management and Sharing Policy encourages sharing metadata that describes or supports your scientific data. NIH recommends data management and sharing practices consistent with the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles, and it strongly encourages the use of established data repositories for preserving and sharing data.

In some instances, the full scientific data cannot be shared easily. This may be due to large file sizes — particularly with imaging-related research — or data privacy regulations. However, even if the actual scientific data cannot be shared, sharing metadata is still valuable. This practice ensures that there is a public record of the data's existence and provides important background information that can be used by other researchers.

Metadata is also a powerful tool for finding scientific data in repositories. Researchers can use metadata to search for data sets that match specific criteria. One tool that can help researchers find relevant data is the NIAID Data Ecosystem Discovery Portal, which uses metadata present in data stored in repositories to search across over 50 different IID repositories and data sources.

Learn more about developing a data management and sharing plan and compliance with relevant NIH data sharing policies by reviewing the Data Policy and Guidance page.

Powerful Sequencing Tool Helps Identify Infectious Diseases in Mali

NIAID Now | February 27, 2025

Two scientists are shown at the University of Sciences, Techniques, and Technologies of Bamako, Mali, performing testing on plasma specimens from patients.

Powerful Sequencing Tool Helps Identify Infectious Diseases in Mali

An advanced diagnostic tool used in an observational clinical study in Bamako, Mali, helped identify infectious viruses in hospital patients that normally would have required many traditional tests. Scientists, led by the National Institute of Allergy and Infectious Diseases (NIAID), designed the study to help physicians identify the causes of unexplained fever in patients and to bring awareness to new technology in a resource-limited region.

Because malaria is the most common fever-causing illness in rural sub-Saharan Africa, most medical workers in the region presume patients with a fever have malaria. But recent NIAID work has identified dengue, Zika and chikungunya viruses – like malaria, all spread by mosquitos – in some Malian residents.

The observational study of 108 patients, published recently in The American Journal of Tropical Medicine and Hygiene, added the advanced diagnostic test, known as VirCapSeq-VERT, to traditional testing methods to identify cases of measles, SARS-CoV-2, HIV, and other viral diseases in patients. Surprisingly, more than 40% of patients were found to have more than one infection.

VirCapSeq-VERT is the virome capture-sequencing platform for vertebrate viruses, a powerful DNA sequencing technique capable of finding all viruses known to infect humans and animals in specimens, such as plasma. VirCapSeq-VERT uses special probes that capture all virus DNA and RNA in a specimen, even if the researcher does not know which specific virus to look for. Scientists then sequence the captured DNA and RNA to identify viruses present to solve the mystery of which viral infection(s) a patient has.

In the study, the researchers recommend that combining VirCapSeq-VERT with traditional diagnostic tests could greatly assist physicians “in settings with large disease burdens or high rates of coinfections and may lead to better outcomes for patients.”

Scientists from NIAID’s Division of Clinical Research collaborated on the project from July 2020 to October 2022 with colleagues from the University of Sciences, Techniques, and Technologies of Bamako, Mali, and Columbia University.

Reference: A Koné, et al. Adding Virome Capture Metagenomic Sequencing to Conventional Laboratory Testing Increases Unknown Fever Etiology Determination in Bamako, Mali. The American Journal of Tropical Medicine and Hygiene DOI: https://doi.org/10.4269/ajtmh.24-0449 (2024).

Contact Information

Contact the NIAID Media Team.

301-402-1663
niaidnews@niaid.nih.gov

Search NIAID Blog

Subscribe to Data Science

For Researchers

Research Areas

Featured Disciplines & Approaches

Featured Diseases & Conditions

Latest News Releases

Who We Are

What We Do

Precise mycobacterial species and subspecies identification using the PEP-TORCH peptidome algorithm

kir-mapper: A Toolkit for Killer-Cell Immunoglobulin-Like Receptor (KIR) Genotyping From Short-Read Second-Generation Sequencing Data

Automatic detection and extraction of key resources from tables in biomedical papers

Putting computational models of immunity to the test—An invited challenge to predict B.pertussis vaccination responses

Quantitative characterization of tissue states using multiomics and ecological spatial analysis

Why the growth of arboviral diseases necessitates a new generation of global risk maps and future projections

Partially characterized topology guides reliable anchor-free scRNA-integration

Using the NIAID Data Ecosystem Discovery Portal to Search Across Data Repositories

Using metadata to drive discovery

New Program Collection tool and other features

Understanding Metadata: A Key to Data Sharing and Reuse

What is metadata?

Collecting rich metadata during research

1. Determine necessary metadata content and formats

2. Create metadata throughout the data lifecycle

3. Prepare to share data and metadata

Sharing data and metadata

Powerful Sequencing Tool Helps Identify Infectious Diseases in Mali

Contact Information

Search NIAID Blog