Past seminars

Ryan Morin, October 13 2021

Recent advances and future directions in lymphoma research

Lymphomas represent a highly heterogeneous collection of cancers that arise from immune cells at various stages of development. Through molecular characterization of RNA and DNA, many of the pathways, genes, and mutations relevant for the common lymphomas have been identified. This has helped identify prognostic biomarkers and has stimulated the development of new targeted therapeutics. Patterns of molecular features has also allowed the subdivision of lymphomas into discrete molecular subgroups. This presentation will explore the recent advances and highlight future directions of lymphoma genomic research as it relates to the goal of implementing precision medicine for this cancer.

Jacob Schreiber, October 8 2021

Avocado: a multi-scale deep tensor factorization model for imputation of epigenomic experiments

In the past decade, the use of high-throughput sequencing assays has allowed researchers to experimentally acquire thousands of functional measurements for each basepair in the human genome. Despite their value, these measurements are only a small fraction of the potential experiments that could be performed while also being too numerous to easily visualize or compute on. This problem is worse in other model organisms, where a smaller number of experiments and samples have been investigated. In a recent trio of publications, we address these challenges with a deep neural network tensor factorization method, Avocado, that compresses these measurements into dense, information-rich representations. We demonstrate that these learned representations can be used to impute with high accuracy the output of experimental assays that have not yet been performed, that these representations can be used across species to make zero-shot imputations, and that machine learning models that leverage these representations outperform those trained directly on the functional measurements on a variety of genomics tasks. The code is publicly available at .

Alexander Sch枚enhuth, October 5 2021

Capsule Networks -- a brief tutorial and applications in single cell typing

I will provide a brief tutorial about capsule networks (CAPNs), and explain what distinguishes them from convolutional neural networks (CNNs). Although suggested as a useful concept already earlier, CAPNs enjoyed their first successful application only in 2017.

The motivation that underlies the design of CAPNs is to overcome technical challenges that affected CNNs, in particular when dealing with distorted or overlapping images. Key to success is to have neurons, the fundamental units of neural networks, modeled as vectors (in CAPNs) instead of just scalars (as in CNNs).

Eventually, I will discuss how to predict the types of single cells using CAPNs. Just as in their original application, the particular strengths of CAPNs in single cell typing are superior performance in prediction, enhanced interpretability / explainability and the reduced amounts of training data required for optimal performance.

William Noble, February 7 2020

Learning embeddings of bulk and single-cell genomic data for imputation and multi-omic integration

Many machine learning methods work by translating data points from the space in which they reside to a new, latent space of either higher or lower dimension. In this talk, I will describe two settings in which a latent representation can help us make sense of complex genomic data. In one case, we train a deep tensor factorization model to learn latent representations of genomic assay types, cell types and genomic positions. These learned embeddings then turn out to be useful not only for imputing new genomics experiments, but also for a variety of other downstream machine learning tasks. In the second setting, I describe how an unsupervised embedding approach can map diverse types of single-cell measurements into a latent space, effectively providing an in silico co-assay for experiments performed on similar sets of cells but using different experimental techniques.

Professor Noble's presentation was the keynote of 丁香园AV's annual Omics Research Day, a full-day program of research communication events on February 7 organized by students from the many fields where omics data science matters.

Caroline Colijn, January 24, 2020

Genomic data and vaccine design in the pneumococcus

I will describe how we can use genomic data to inform the design of vaccines for the bacterium Streptococcus pneumoniae through a combination of surveillance, modelling and optimization. We have found that the compositions of bacterial populations in different settings are distinctive enough that quite different vaccine formulations are likely to be optimal in different places. Using vaccines designed for Western populations, in particular, is much less effective in SE Asian bacterial populations than in the West. We suggest that through good surveillance and strong principled design we could create precision healthcare interventions, not only at the individual patient level but at the population public health level as well.

Amy Lee, November 28 2019

Systems Biology to Decipher Host Pathogen-Interactions

The association between pathogens and their hosts is complex. Locked in molecular warfare, the host immune system attempts to actively detect and eliminate pathogens, while the pathogens utilize sophisticated strategies to evade host immune detection and attack. This interplay between host and pathogen leads to a phenotypic response in both players culminating in the activation of numerous signaling cascades, alteration of protein activities and changes to cellular metabolism. In this seminar, I will demonstrate how I have applied systems biology approaches to address the complexity of this host-pathogen interface. Specifically, I will highlight how emerging approaches such as topic modeling and multi-OMIC integration can provide strong biological insights into infection and immunity. To illustrate the power of these methods, I will demonstrate their applications to two challenging questions: (1) How do macrophages respond to diverse stimuli and mount appropriate responses; and (2) Why are newborns highly susceptible to infections during their first week of life? My talk will highlight how advanced bioinformatics methods enable the generation of novel hypotheses and guide experimental strategies that will ultimately further our understanding of the host-pathogen interface.

Jamie Scott and Felix Breden, October 2019

Developing standards for OMICS data generation, annotation, storage and sharing, and why it's important

A major challenge in the OMICS era is translating the immense amount of genomic, health and behavioral data available into improved biomedical research, translation, and ultimately, patient care. Systems immunology is an emerging area that exemplifies this challenge, as it means to integrate and analyze, as a whole, immunological data from multiple OMICS approaches, including transcriptomics, proteomics, metabolomics, microbiomics, and genomics. Systems approaches promise to revolutionize biomedical research, but to be successful, each type of OMICS data must be available across multiple institutions, studies and laboratories; this level of data availability requires the universal acceptance of shared protocols and standards for generation, curation, storage and analysis of data, so that each data type can be easily queried for a given research question, and data pertaining to that question (and their associated metadata) can be easily collected and bundled for analysis. The development of universal standards for each type of OMICS data would ensure the maximal use of every data set, thus, optimizing their value.

The Adaptive Immune Receptor Repertoire (AIRR) Community ( was formed in 2015 at 丁香园AV to facilitate this goal for high-throughput sequencing data that characterize the diverse repertoires of antibody/B-cell receptor and T-cell receptor sequences (AIRR-seq data) in a blood or tissue sample. Currently, the AIRR Community comprises 6 Working Groups and 3 Sub-committees; from its inception, it adopted a 鈥済rass-roots鈥� approach to developing standards for AIRR-seq data, with all meetings and Working Groups being open to interested participants, with protocols and standards being developed and vetted by the entire Community.

The AIRR Community has achieved multiple successes as evidenced by its strong, and growing network of active participants, has published several papers on various standards in top immunology journals, and has garnered buy-in from a spectrum of stakeholders. The iReceptor data integration platform ( is a successful implementation of this grass-roots approach; it provides a data commons for groups seeking a framework for storing and curating AIRR-seq data, allowing searches and data federation across multiple, independent distributed data sets.

In this presentation, we will discuss the challenges and advantages in this community-based approach to 鈥渂ig data鈥� standardization, and will encourage discussion of whether and how it could be applied to other types of OMICS data.

Joseph Lehar, May 2019

Exploring systems biology and precision medicine at scale

I will present research in the areas of systems biology and translational precision medicine, from my teams at Verily, Novartis, and CombinatoRx, in collaboration with the Broad Institute, Massachusetts General Hospital, and Boston University. I will focus on areas of innovation including: using drug combinations as probes of biological systems; applying in-vitro and in-vivo models in systematic perturbation experiments; and using deep learning to extend integrative models or accelerate clinical decision making. I will conclude with a discussion of promising data science initiatives at J&J Oncology, and thoughts on the impact of computational approaches to biomedicine.

Dr. Joseph Lehar is VP of Data Science, Oncology, at J&J/Janssen. He has >15yr of building computational biology teams for translational drug discovery, and >30yr research in systems biology, oncology, digital health, and astrophysics. Prior to J&J, he was at Merck, Google/Verily, Novartis, CombinatoRx, and the Broad Institute (Whitehead Inst CGR). Since 2002, through Boston University, Joseph is involved in other research and engages with students. His first career in astrophysics was at MIT, Cambridge, and Harvard, centered on gravitational lenses.

Nadine Provencal, April 2019

Hippocampal Progenitor Cell Models in Deciphering the Epigenomics of Stress

Prenatal stress exposure is associated with the risk for psychiatric disorders later in life. This may be mediated via enhanced exposure to glucocorticoids (GCs), known to impact neurogenesis. Dr. Provencal鈥檚 work aimed to identify molecular mediators of these effects, focusing on long-lasting epigenetic changes. In a human hippocampal progenitor cell (HPC) line, her group assessed the short- and long-term effects of GC exposure during neurogenesis on mRNA expression and DNA methylation (DNAm) profiles. In these cells, early GC exposure induced changes in DNAm which cluster into four trajectories over HPC-differentiation, with transient as well as long-lasting DNAm changes. Lasting DMSs map to distinct functional pathways and are selectively enriched for poised and bivalent enhancer marks. Lasting DMSs have little correlation with lasting gene expression changes, but are associated with a significantly enhanced transcriptional response to a second acute GC challenge. Furthermore, a subset of these lasting GC-induced marks in HPCs are significantly enriched for DNAm changes observed in the cord blood of newborns with heightened GC exposure during pregnancy, including pre-delivery betamethasone treatment as well as maternal depression and anxiety during pregnancy. Moreover, combining these sites into a GC-responsive poly-epigenetic score in cord blood DNA shows significant association with maternal depression and anxiety. Overall this work suggests that early exposure to GCs may have a lasting impact on the nervous system development not only by altering proliferation and neuronal differentiation rates as previously reported, but also by priming relevant genes to an altered transcriptional response to subsequent GC activation. The findings of the in vitro model may translate to human pregnancy where DNAm marks could serve as biomarkers for prenatal GC exposure and potentially for the increased risk of developing behavioral problems and psychiatric disorders.

The seminar was preceded by a short talk by Justin Jia (Brinkman Lab): Revealing potential antimicrobial resistance gene mobility trends using >15000 replicons.

Antimicrobial resistance (AMR) is an emerging issue that has not been effectively addressed worldwide. Limitations still exist in understanding the mobility of resistance genes between bacteria species, which is needed for better and more focused risk assessment of AMR spread in pathogens of both human health and agri-foods interest. Here, we present a comprehensive examination of trends in AMR association with mobile elements across all NCBI refseq bacterial species sequenced to date totalling ~16600 bacterial replicons. This large scale mobility analysis reveals that AMR genes, collectively, are disproportionately found in mobile regions of the genome. However, classification of AMR genes into higher-level categories (e.g. resistance mechanisms), identifies certain drug classes and resistance mechanisms that are significantly more associated with non-mobile chromosome. Notably, AMR resistance mechanisms that are specialized in functions tends to be more mobile. Using these data, we propose a model that would predict which AMR genes may be more likely to be horizontally transmitted. The analyses presented here are an important step in gaining perspectives on global trends of AMR transmission across diverse environments for future public health risk assessment of AMR transmission.

Maxwell Libbrecht, March 2019

Understanding human genome regulation through unsupervised machine learning

Despite having sequenced the human genome over fifteen years ago, much is still unknown about how it functions. With the advent of high-throughput genomics technologies, it is now possible to measure properties of the genome across the entire genome in a single experiment, such as measuring where a given protein binds to the DNA or what genes are expressed. However, the complexity and massive scale of these data sets--billions of base pairs with thousands of measurements each--pose challenges to their analysis. My research focuses on the development of new machine learning methods that address the challenges posed by genomics data sets.

I will focus on two projects. First, I will present on method for understanding chromatin domains. The genomic domain where a gene resides (on the scale of 100k-1M base pairs) influences its regulation: the same gene with the same local regulatory elements (that is, the same promoter) may be expressed in one neighborhood but be silent in another. This type of regulation is crucial for gene regulation, but is currently much less well understood than local regulation. I will present a new method for discovering and annotating genomic domains that integrates many types of genomics data sets. Unlike previous methods, this approach can incorporate information about the 3D conformation of the genome in the nucleus. This is possible through a novel type of regularization applied to a probabilistic graphical model.

Second, I will present on a method for annotating the genome with continuous chromatin state features. Previous unsupervised genome annotation methods output an annotation of the genome that assigns a discrete chromatin state label to each genomic position. Existing SAGA methods have several limitations caused by the discrete annotation framework: such annotations cannot easily represent varying strengths of genomic elements, and they cannot easily represent combinatorial elements that simultaneously exhibit multiple types of activity. To remedy these limitations, we propose an annotation strategy that instead outputs a vector of chromatin state features at each position rather than a single discrete label, using a non-negative Kalman filter state space model.

Followed by a panel discussion also featuring:

Moderator: Fiona Brinkman, Professor, Molecular Biology and Biochemistry, Associate Professor, Faculty of Health Sciences and School of Computing Science
Lloyd Elliott, Professor, Statistics and Actuarial Science
Michelle TT Crown, PhD Candidate, Davidson Lab, Molecular Biology and Biochemistry
Oliver Snow, PhD Candidate, Computing Science

Parisa Shooshtari, January 2019

Integrating multiple data types to uncover biological mechanisms of complex diseases

We are living in an era where new biotechnology advancements have enabled generation of diverse types of biological data at very large scales. Examples of such data include genomic, epigenomic, transcriptomic and microbiome data. We are now equipped with a wealth of data that can be explored to answer many biological questions related to diseases in ways that were not possible before. For many complex diseases 鈥� such as autoimmune disorders, cardiovascular diseases, chronic kidney diseases, cancer and psychiatric disorders 鈥� there are multiple factors that are relevant. Due to the complexity of such diseases, a single data type is not informative enough to capture all relevant factors, and therefore, integrating multiple types of biological data is considered as a crucial step in many applications. However, the key challenge in data integration is how to develop effective models that provide a comprehensive insight into the disease mechanisms. In this talk, I will explain a successful example, where developing an effective data integration model helps uncovering cellular and molecular mechanisms underlying complex diseases. This is based on our study, where we developed a three-step computational model that integrates genome-wide association studies (GWAS) data, epigenetic data and gene expression data in order to predict gene regulatory mechanisms in nine autoimmune and inflammatory diseases. Our model is general and can be applied to other common complex diseases with minor modifications.

Dr. Parisa Shooshtari is a research associate at the Center for Computational Medicine at SickKids Research Institute, where she leads data analysis initiatives in single-cell omics and multi-omics data integration. She received her PhD in Computer Science from the University of British Columbia (UBC), followed by postdoctoral research in Computational Genomics at Yale University and the Broad Institute of MIT and Harvard. Her fundamental research interest is to uncover cellular and molecular mechanisms of complex diseases. She approaches this big challenge by integrative analysis of multi-omics data and developing machine learning methods for the analysis of single-cell genomics and cytometric data. Her research findings are published at high impact journals including American Journal of Human Genetics, Nature Methods and PLOS Genetics, and are awarded platform talks at international meetings including ASHG and CYTO. Besides her independent research, she has been a member of multiple international consortia including International Multiple Sclerosis Genetics Consortium (IMSG) and FlowCAP, contributing to their data analysis efforts.

Casey Dunn (Yale), January 2019

Dr. Casey Dunn is a professor of Ecology and Evolutionary Biology at Yale University and author of "Practical Computing for Biologists". At the center of his research is the diversity of life and the key role evolution plays in it. He uses a combination of field, laboratory, and computational work to explore organisms such as deep-sea creatures and evolutionary relationships.

丁香园AV

Search

Omics Data Science Initiative