NIAID Data Science Previous Seminars

2021 Seminars

Paul A. Harris, Ph.D., FACMI

Director, Office of Research Informatics
Professor, Department of Biomedical Informatics, Biostatistics, and Biomedical Engineering  
Vanderbilt University Medical Center

August 6, 2021

Abstract

REDCap is a secure data management platform developed and disseminated at no cost by Vanderbilt University Medical Center. Its streamlined process for rapidly creating and designing projects offers a vast array of tools that can be tailored to virtually any data collection strategy. REDCap provides automated export procedures for seamless data downloads to Excel and common statistical packages (SPSS, SAS, Stata, R), as well as a built-in project calendar, a scheduling module, ad hoc reporting tools, and advanced features, such as branching logic, file uploading, and calculated fields. In addition, for clinical data, REDCap supports interoperability through the Fast Healthcare Interoperability Resources (FHIR) specification. The REDCap Consortium includes over 5,000 non-profit academic, non-profit, and government organizations in 141 countries across the world. This talk will describe the origins of REDCap and the REDCap Consortium, highlight new and upcoming functionality for the platform, and provide innovative use cases supporting translational research and research operations in immune-mediated and infectious diseases.

Joao Xavier, Ph.D.

Faculty Member
Sloan Kettering Institute for Cancer Research

July 2, 2021

Abstract

The impact of the gut microbiota in human health is affected by several factors including its composition, drug administrations, therapeutic interventions and underlying diseases. Unfortunately, many human microbiota datasets available publicly were collected to study the impact of single variables, and typically consist of outpatients in cross-sectional studies, have small sample numbers and/or lack metadata to account for confounders. These limitations can complicate reusing the data for questions outside their original focus. We compiled a comprehensive longitudinal patient dataset that overcomes those limitations: a collection of fecal microbiota compositions (>10,000 microbiota samples from >1,000 patients) and a rich description of the “hospitalome” experienced by the hosts, i.e., their drug exposures and other metadata from patients with cancer, hospitalized to receive allogeneic hematopoietic cell transplantation (allo-HCT) at a large cancer center in the United States. The data were published as a descriptor in the journal Scientific Data together with five examples of how to apply these data to address clinical and scientific questions on host-associated microbial communities.

Rangan Sreenivas Sukumar, Ph.D.

Distinguished Technologist
Hewlett Packard Enterprise (HPE) 

June 4, 2021

Abstract

In March of 2020, the “Force for Good” pledge of intellectual property to fight COVID-19 brought into action HPE products, resources and expertise to the problem of drug/vaccine discovery. Several scientists engaged in collaborations with HPE volunteers to accelerate efforts towards a drug/vaccine. This talk documents the spirit and outcome of such a collaboration of domain and data science and as an example of how artificial intelligence (AI), when applied with explainable context is augmented intelligence – one that empowers human experts to excel at their best by doing what computers do best. More specifically, we will demonstrate AI augmenting experts on hypothesis generation tasks by connecting and reasoning with a curated knowledge universe of medical facts and data. We explain the construction of a knowledge graph from 13 open datasets such as PubChem, UniProt, CHEMBL, RCSB, ClinicalTrials.gov etc. (30 TBs in size with 150 billion medical facts/properties) and present the power of a massively parallel-processing database for interactive and exploratory discovery from multi-modal data (protein sequences, knowledge facts, and tables). On this knowledge graph we will show the ability to search for the “what-is”, “what-if”, “what-else” and the “what-could-be” using reasoning algorithms. We will show results from queries capable of comparing protein-sequences (~4 million comparisons per query in under a minute), and explain how one scientist during one of our hackathons was able to look for common proteins in COVID-19 (and newer variants) in other sequenced viruses, bacteria and fungi, search for previously-studied protein activity in other organisms and further extrapolate that knowledge to known protein-ligand activity from clinical trials data. This curiosity established a workflow for drug repurposing using our knowledge graph that serendipitously discovered the connection between Tetanus and COVID-19 posing the question - “Is Tetanus vaccination contributing to reduced severity of the COVID-19 infection?”. We will conclude this talk with a live demo, encouraging domain and data scientists to pose questions beyond COVID-19 on this massive knowledge graph and engaging with our team for further collaboration.

Nina Fefferman, Ph.D.

Director of NIMBioS (National Institute for Mathematical and Biological Synthesis)
Associate Director of the One Health Initiative at University of Tennessee
Professor of Ecology and Evolutionary Biology at University of Tennessee, Knoxville
Professor of Mathematics at University of Tennessee, Knoxville

May 7, 2021

Abstract

Novel and emerging threats, such as this past year's COVID-19 pandemic, provide an immediate call to action for researchers across the full spectrum from basic and applied biomedical, natural, and physical sciences, mathematics, and engineering, to social sciences, communication, and policy. However, working to decrease knowledge gaps does not always translate equally into increasing efficacy in threat readiness and response. These differences can be due either to the scale of impact enabled by increased understanding, or by the timing by which those insights (and therefore benefits) can be achieved. In this talk, we will discuss perspectives and formal techniques for identifying and prioritizing research to address time-scale appropriate gaps in understanding that can meaningfully alter the success of efforts to manage novel threats in real time, whether in gradual anticipatory preparation or in rapid response.

Tonia Korves, Ph.D.

Lead Data Scientist
Data and Human-Centered Solutions Innovation Center
MITRE Corporation

April 2, 2021

Abstract

As COVID-19 research rapidly escalated last year, we quickly built a platform to help biomedical experts track published research about potential therapeutics and vaccines. The platform includes a natural language processing pipeline that identifies scientific documents about SARS-CoV-2 and other viruses, particular drugs, and vaccine types, sorted by stages of research, and a dashboard called the COVID-19 Therapeutic Information Browser, available at covidtib.c19hcc.org. The comprehensive data from this platform enables us to characterize COVID-19 drug research over time and at scale, and potentially draw lessons that can inform future decisions. In this talk, we will present our natural language processing methods, the dashboard, and an analysis of trends in published COVID-19 drug research and clinical trials over the past year. We will also discuss other uses for this data, outstanding challenges, and other potential applications of this approach.

Patrick D. Schloss, Ph.D.

Frederick G. Novy Collegiate Professor of Microbiome Research
Department of Microbiology & Immunology
University of Michigan Medical School

March 5, 2021

Abstract

The “reproducibility crisis” in science affects microbiology as much as any other area of inquiry, and microbiologists have long struggled to make their research reproducible. Schloss recently delineated a framework for identifying and overcoming threats to reproducibility, replicability, robustness, and generalizability of microbiome research that is broadly applicable to other areas of microbiology. There are many reasons why a researcher is unable to reproduce a previous result, and even if a result is reproducible, it may not be correct. Furthermore, failures to reproduce previous results have much to teach us about the scientific process and microbial life itself. To help safeguard against threats to reproducibility, Schloss developed the Riffomonas Reproducible Research tutorial series. This is a collection of tutorials that focuses on the improvement of reproducible data analysis for those doing microbial ecology research. Although the materials focus on issues in microbial ecology, the principles are broadly applicable. Each tutorial presents broad concepts and how they are related to reproducibility as well as applied practice using specific tools that are designed to foster reproducibility. Instead of seeing signs of a crisis in others’ work, we need to appreciate the technical and social difficulties that limit reproducibility in the work of others as well as our own. 

John Tsang, Ph.D.

Chief, Multiscale Systems Biology Section, Laboratory of Immune System Biology, NIAID 
Co-Director, NIH Center for Human Immunology (CHI)

February 5, 2021

Abstract

We develop and integrate multi-omics and data science approaches to monitor the human immune system before and after perturbations to uncover predictors and potential determinants of responsiveness and outcomes. Here I cover two areas of our systems immunology investigations involving immune responses to vaccination and COVID-19 in humans. First I discuss baseline blood transcriptional signatures predictive of antibody responses to both influenza and yellow fever vaccinations in healthy subjects. These same signatures evaluated at clinical quiescence are also correlated with disease activity in patients with systemic lupus erythematosus with plasmablast-associated flares. CITE-seq multi-modal profiling of surface proteins and transcriptomes of single peripheral immune cells from healthy high and low influenza vaccination responders revealed that our signatures reflect the extent of activation in a plasmacytoid dendritic cell–type I IFN–T/B lymphocyte network. Our findings raise the prospect that modulating such immune baseline states may improve vaccine responsiveness and mitigate undesirable autoimmune disease activity. Next we turn to COVID-19, which exhibits extensive patient-to-patient heterogeneity. To link immune response variation to disease severity and outcome over time, we longitudinally assessed circulating proteins as well as 188 surface protein markers, transcriptome, and T-cell receptor sequence simultaneously in single peripheral immune cells from patients. Conditional-independence network analysis revealed primary correlates of disease severity, including gene expression signatures of apoptosis in plasmacytoid dendritic cells and attenuated inflammation but increased fatty acid metabolism in CD56dimCD16hiNK cells linked positively to circulating levels of IL-15. While cellular inflammation was depressed in severe patients early after hospitalization, it became elevated by days 17-23 post symptom onset, suggestive of a late wave of inflammatory responses. Furthermore, circulating protein trajectories at this time were divergent between and predictive of recovery-fatal outcomes. Our findings stress the importance of timing in the analysis, clinical monitoring, and therapeutic intervention of COVID-19

Lucila Ohno-Machado, M.D., Ph.D., M.B.A.

Professor of Medicine 
Chair, Department of Biomedical Informatics  
Associate Dean for Informatics and Technology 
University of California San Diego 

January 8, 2021

Abstract

Data sharing is essential for the acceleration of science, but privacy concerns need to be addressed before clinical data can be properly shared for research. I will briefly introduce the main issues in clinical data sharing, as perceived by researchers and patients, and describe how a combination of privacy technology (i.e., methods that make it difficult to identify a specific patient whose data are going to be shared) and policy can help strike a balance between data utility for researchers and privacy protection for the patient and healthcare institutions.

2020 Seminars

Sanchita Bhattacharya

Bioinformatics Project Leader
Barker Computational Health Sciences Institute(BCHSI)
University of California, San Francisco (UCSF)

Zicheng Hu, Ph.D.

Research Scientist 
Barker Computational Health Sciences Institute(BCHSI)
University of California, San Francisco (UCSF)

December 5, 2020

Abstract

In the field of clinical research, we are just beginning to explore repurposing the open-access datasets to build a knowledge base, gain insight into novel discoveries, and generate data-driven hypotheses that were not originally formulated in the published studies. This presentation will showcase the significant efforts in the meta-analysis of open-access immunological studies and secondary analysis of clinical trial data from NIAID-DAIT funded ImmPort database. We are also going to present a case study on analyzing cytometry data using deep learning models, recently published in PNAS.

Bernhard Palsson, Ph.D.

Distinguished Galletti Professor of Bioengineering, Department of Bioengineering, UC San Diego
Professor of Pediatrics, UC San Diego School of Medicine

November 6, 2020

Abstract

The need to integrate knowledge types into big data analytics, generally referred to as explanatory-artificial-intelligence (x-AI), is growing. This talk will describe progress with three approaches to such knowledge enrichment: 1) the use of Independent Component Analysis (ICA) to define independently modulated sets of genes in bacterial transcriptomes, 2) the use of pangenome analysis for the thousands of bacterial genome sequences being generated, and 3) the use of machine learning methods for the analysis of antimicrobial resistance. The first case illustrates the principle of ‘getting answers to questions not asked,’ the second case illuminates ‘what is learned with scale,’ and the third case shows how mechanisms are built into genome-wide association studies (GWAS) using flux balance analysis (FBA).

Evan Floden, Ph.D.

CEO & Co-founder
Seqera Labs
Barcelona, Spain

October 3, 2020

Abstract

Nextflow is a popular open-source framework for the development and deployment of FAIR data analysis pipelines. It simplifies the use of multi-scale containers and allows seamless integration with batch schedulers as well as built-in support for cloud computing services such as AWS Batch, Google Cloud Life Sciences and Kubernetes. We will discuss the key design elements of Nextflow and compare and contrast these with other approaches such as CWL and Snakemake. I will introduce an important new revision to the Nextflow syntax, termed DSL2, which represents a major shift with the ability to modularize and reuse workflow components. Finally, I will touch on the community building aspects through the work of nf-core and the future of Nextflow in the context of Nextflow Tower.

Dave Clements

Galaxy Community Manager
Johns Hopkins University

Steven Weaver

Senior Programmer Analyst
Temple University 

September 4, 2020

Abstract

Galaxy is an open web-based platform for data integration and analysis in the life sciences. Galaxy makes sophisticated bioinformatics analysis accessible to bench researchers without requiring them to learn Linux system administration or command line interfaces.  Every tool and tool setting is automatically recorded by Galaxy, making analyses reproducible by default. Analyses can also be shared with colleagues and with the public, enabling others to re-use and reproduce analyses pipelines. In the first part of this webinar, we will introduce Galaxy and its supporting ecosystem and community. This will include the many ways Galaxy is available to researchers, and a brief overview of the Galaxy user interface.

In the second part, we will walk through an application of Galaxy to SARS CoV-2 research.  We developed and published public reproducible Galaxy workflows for processing raw deep sequencing read data and calling intra-host genomic variants, as well as processing GISAID full-genome data in a comparative evolutionary framework. The goal of our analysis is to make use of all readily available sources of information to create a frequently updated list of sites in the SARS-CoV-2 genome that may be subject to positive or negative selection. High ranking sites on the list, especially those that are consistently detected over time or accumulate additional evidence in their favor with more data, could be taken as a set of candidates for functional impact or other downstream analyses. We search for evidence of selection at three different evolutionary levels: intra-host (next generation sequencing (NGS) data), between SARS-CoV-2 isolates (assembled genome data), and among beta-coronavirus isolates that are closely related to SARS-CoV-2 (assembled genome data).   In this webinar, we will review the comparative analysis dashboard that can be used to which sites may have a functional impact or could be used for further downstream analysis, as well as how Galaxy can be used to implement the pipeline on researchers' datasets.

Participants will learn how Galaxy is available, the basics of using Galaxy for data analysis, and how it can be applied in immunology in an example domain.

Melissa Haendel, Ph.D.

Director of the Center for Data to Health, Oregon Health and Science University
Director of Translational Data Science, Oregon State University

May 15, 2020

Abstract

The COVID-19 global emergency raises many difficult care and healthcare management questions. Who is infectious? Who should be tested? Who may need hospital care and at what level? What are the key risk factors? What are the best prognostic indicators? Which drugs are the most viable candidates for patients? What are best practices for ethical resource allocation? How can we efficiently and effectively assemble a cohort for a trial? How can we rapidly deploy clinical decision support tools when new knowledge is available every day? In this time of global crisis, we can rise to work together to address these and many other important questions.

The National Center for Data to Health (CD2H) and NCATS are coordinating the creation of a centralized, secure portal for hosting COVID-19 clinical data - called the National COVID Cohort Collaborative (N3C). This initiative is a partnership among NIH institutes, HHS, VA, FDA, the CTSA program, distributed clinical data networks PCORnet, OHDSI, ACT/i2b2, and TriNetX, and other clinical institutions. The N3C will create a national limited dataset of Covid-19 patients and controls consisting of all available EHR data related to these patients transforming them into a common analytic model. The cloud-based, FedRamp certified collaborative portal will enable development of machine learning and other informatics tools that require a large patient-level dataset, and will be overseen by a data access committee.

This global pandemic presents a unique opportunity to bring together top informatics experts from around the country to address our collective challenges. The N3C resource will offer a valuable and complementary contribution, not only for aiding the COVID-19 crisis, but for transforming how we perform global clinical research as a nation. We believe this portal will provide additional assets needed to rapidly develop the analytics that clinical centers and physicians need now.

Peter Karp, Ph.D.

Director, Bioinformatics Research Group Artificial Intelligence Center 
SRI International

May 1, 2020

Abstract

BioCyc.org 1 is an extensive web portal containing 17,000 microbial genomes and associated metabolic pathways. BioCyc databases are created through a process that combines computational inferences with imported and curated data from multiple sources. The first step in the creation of BioCyc databases is to run prediction algorithms for metabolic pathways, operons, PFam domains, and orthologs. We next run programs that import data from related databases (such as UniProt) including regulatory network data, protein features, subcellular locations, and Gene Ontology assignments.

Curated databases next receive intensive review and updating by a Ph.D. biologist that includes reviewing the computationally predicted metabolic pathways, entering new gene functions and metabolic pathways from the experimental literature, and defining protein complexes. The resulting databases are high-quality reference sources for the latest gene and pathway information. Overall the BioCyc databases have been curated from 95,000 publications.

The BioCyc website provides extensive bioinformatics tools for searching and analyzing these databases, and leveraging them for analysis of omics datasets. Genome-related tools include a genome browser, sequence searching and alignment, and extraction of sequence regions. Pathway-related tools include pathway diagrams, a tool for navigating zoomable organism-specific metabolic map diagrams, and a tool for searching for metabolic routes that transform a starting metabolite into a product metabolite. Regulation tools depict operons and regulatory sites, as well as showing full organism regulatory networks. Comparative analysis tools enable comparisons of genome organization, of orthologs, and of pathway complements. Omics data analysis tools support enrichment analysis and painting of transcriptomics and metabolomics data onto individual pathway diagrams and onto zoomable metabolic map diagrams. A new Omics Dashboard tool enables interactive exploration of omics datasets through a hierarchy of cellular systems. SmartTables enable users to construct tables of genes, metabolites, or pathways, and to perform analysis such as transforming a set of pathways to all genes within the pathway set.

1 P.D. Karp et al., "The BioCyc collection of microbial genomes and metabolic pathways," Briefings in Bioinformatics, 2017.

Emma Hodcroft, Ph.D.

Emma Hodcroft, Ph.D., Biozentrum
University of Basel, Basel, Switzerland

James Hadfield, Ph.D.

Fred Hutchinson Cancer Research Institute

April 3, 2020

Abstract

The emergence of SARS-CoV-2 in China has driven an enormous global effort to contribute and share genomic data in order to inform local authorities and the international community about key aspects of the outbreak. Analyses of these data have played an important role in tracking the epidemiology and evolution of the virus in real-time. Nextstrain is an open science initiative to harness the scientific and public health potential of pathogen genome data, and has previously provided key insight into outbreaks of Ebola and Zika, and longer-term pathogen spread of Influenza and Enterovirus. This initiative provides a continually-updated view of publicly available data alongside powerful analytic and visualization tools for use by the community.

Drs. Hodcroft and Hadfield, along with other members of the Nextstrain team, have been maintaining an up-to-date analysis of SARS-CoV-2 at nextstrain.org/ncov since January 20th 2020. This talk will provide an overview of Nextstrain and how it embodies ‘FAIR’ principles (Findable, Accessible, Interoperable, Reusable), as well as outlining what insights Nextstrain has provided about the COVID-19 outbreak via genomic data sharing from around the world. 

Tim Read, Ph.D.

Professor of Infectious Diseases with Secondary Appointment in Human Genetics 
Emory University

March 6, 2020

Abstract

Technology innovations in genomics that reduce sequencing time and cost have created new opportunities for biological research. Since the mid 2000’s, large scale sequencing of bacterial genomes using Illumina technology has become a standard for pathogen epidemiology studies, resulting in very large data sets for some species. Genome data has been generated faster than can be conveniently analyzed and integrated with results of classical experimental approaches to microbiology. We became interested in the task of analyzing tens of thousands of genomes of the pathogenic bacterium Staphylococcus aureus in the public domain. We created a workflow called the Staphopia Analysis Pipeline (StAP), using Nextflow software, to automate processing (e.g QC, genome assembly, annotation, genotype) using open source bioinformatic tools and databases. The pipeline was encapsulated in a Docker container to allow it to be deployed across software platforms. We collaborated with the Cancer Genomics Cloud and used their Seven Bridges-based platform to process >40,000 genomes in a 10 day period in November 2017. A public instance of StAp was also created at CGC to allow anonymous users to run StAP on their own data. In order to share the results of our analysis with other researches we created the Staphopia database, with public APIs for data download of > 350 endpoints and an R package to enhance data analysis. We have been using Staphopia as both a resource to generate hypotheses (“top-down approaches”) and also to understand how results from lab studies relate to the species as a whole (“bottom-up”). An example of the former analysis is looking at the co-occurrence of resistance to mupirocin and fluoroquinolones with methicillin resistance (MRSA). An example of bottom-up approaches has been mapping the distribution SNPs found to be associated with intermediate vancomycin resistance selected in the laboratory across different subtypes of S. aureus. We have recently created a new series on pipelines called Bactopia, built on the experiences learned from Staphopia but generalizable to any bacterial species. Bactopia consists of a dataset setup step (Bactopia Datasets) where a series of customizable dataset are created for the species on interest. The Bactopia Analysis Pipeline performs analyses based on the dataset downloaded and outputs the processed data to a structured directory format. We have created a series of Bactopia Tools that perform specific post-processing on some or all of the genomes processed. These include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes and taxonomic classification" via gtdb. We have performed a Bactopia demonstration project on 1664 public Lactobacillus genomes in SRA in December 2019.

Contact Information

If you wish to get more information about these seminars such as a copy of the slides, please contact Steve Tsang.

Content last reviewed on