Theoretical and computational biophysics research in GW has a distinctive history that traces back to the heyday of Watson and Crick’s discovery of the double helix in 1953. George Gamow, then a professor of theoretical physics at GW and already well-known for his Big Bang theory, joined the race to crack the genetic code hidden in the double helix. Being a true believer that scientists from different fields should share their ideas and results, Gamow founded the “RNA Tie Club” in 1954 that ushered in an exciting era culminating in the final deciphering of the code by Nirenberg in 1961. Time has changed with the rapid arrival of a new era of genomics, but Gamow’s imprint at GW endures. His spirit of interdisciplinary interactions and collaborations remains the cornerstone of our current research activities.
Living systems are open and far-from equilibrium systems with heterogeneous constituents. What are the design principles and emergent behaviors of these systems? How do the structures encode function? And how do complexity and self-organization in living systems emerge from evolution? The theme of our research is to find clues to these mysteries at the genome and molecular scales from systems, information and biophysical perspectives. To this end, we use complementary approaches including computational and statistical data mining, machine learning and theoretical modeling. As the spatial and temporal pattern of gene expression underlies the behavior of cells and beyond, our main focus is on gene regulation.
Epigenome and enhancer biology
Eukaryotic DNA is packaged into a highly organized chromatin structure with nucleosomes as the fundamental units. Chromatin not only implements the physical compaction of the genome but also enables regulation of the genes encoded in the genome. Components of the nucleosome, including both the DNA sequence and the histone proteins, can be chemically modified. These chromatin modifications can encode information that is potentially heritable and are a major component of the epigenome. While the genome is shared across different cell types, the epigenome is unique to each cell type and underlies cell-type-specific functions. The epigenome plays an essential role in gene regulation in developmental processes, stimulation, and diseases. The epigenome can be programmed, as nature has developed a sophisticated system of epigenetic regulators to read, write and erase the epigenetic information. Our research focuses on charactering the epigenomic landscape (e.g., Nature Genetics 2008) and its dynamics (e.g., Immunity 2009). We are interested in understanding the acquisition of specificity, the dynamic balance of opposite enzymes (Cell 2009), the delineation of epigenetic domains on the genome (Bioinformatics 2009), the interplay of these factors with other components of the gene regulatory network (Nature Immunology 2016), and their functional consequences in normal and cancer cells (Cancer Cell 2018, Nature Communications 2018, Cancer Research 2020). In particular, we are interested in how epigenetic regulators establish, maintain, and decommission enhancers, cell-type-specific regulatory elements on genomes that are critical for cell-type specificity (Elife 2013, PNAS 2016, Nature Communications 2017, Nucleic Acids Research 2017, Nature Genetics, in press).
Genome spatial architecture and its role in gene regulation
Recent advances in genome technology has enabled high-resolution genome-wide profiling of the folding of the genome, which is found to organize in an intricate hierarchical manner with different features at different length scales. Understanding the connection between structure and function has been a central theme of biophysics (e.g., protein folding) and biology, and the characterization of the three-dimensional (3D) genome heralds a new chapter in this pursuit. As the master weaver of the 3D genome, CTCF protein is a main focus of our study. We found that CTCF’s genomic binding and CTCF-mediated interactions exhibit significant variations between cell types, which contribute to the gene regulatory network underscoring cell identity. We developed a machine learning approach that can accurately infer chromatin interactions mediated by CTCF using genomic and epigenetic features (Nature Communications 2018). More recently, we found that CTCF serves as transcriptional cofactor in addition to its conventional role of an insulator. A transcription factor can recruit CTCF to shape genomic architecture, enhance regulatory interactions, maintain cell identity and promote T cell homeostasis (Nature Communications 2021, Nature Immunology 2022). We are interested in understanding the coordination of the dichotomous roles of CTCF, and the interplay of transcription factors and CTCF in general. In addition, we are trying to identify novel structures of the 3D genome at intermediate scales from a network perspective (BioRxrv, 2022).
Post transcriptional gene regulation
Expression of genes are not only regulated at the stage of transcription, but also regulated in RNA processing and degradation. Alternative splicing, including intron retention, is a major contributor to diversity in transcriptome and proteome. We identified regulated intron retention as a novel post-transcriptional regulation mechanism that broadly regulates T cell activation (Nucleic Acids Research, 2016). Recently, we conducted Bru-Seq and BruChase-Seq to quantify transcriptome-wide transcript lifetimes. We confirmed rapid degradation of intron-containing transcripts. We also identified a novel post transcription regulatory pathway for NFkB1 mRNA mediated by RNA binding protein LARP4 (Nucleic Acids Research, 2020). We are interested in exploring the role of regulated intron retention in physiological and pathological conditions.
Algorithm development for analysis of omics data
Functional genomics data from high throughput sequencing have enabled quantitative holistic characterization of cellular states. Making sense of these data presents a real challenge. We develop cutting-edge bioinformatic algorithms for multi-scale integrative analysis of multi-omics data. We developed SICER (Bioinformatics, 2009), an algorithm for noise-filtering in diffusive epigenomic profiles. It has gained widespread recognition and has become one of the standard analytical tools in the community (cited ~1000 times, Google Scholar), incorporated into leading free (Galaxy) and proprietary (Genomatix) software platforms. More recently, we have developed computational methods for analyzing the chromatin interactome. LOLIPOP is a machine-learning algorithm that achieves superior predictive performance and informs the determinants of CTCF-mediated chromatin interactions (Nature Communications 2018). HicHub (BioRxiv, 2022) employs a network approach for detecting hubs of chromatin interaction changes between two conditions. These hubs turn out to be more functionally relevant and provide mechanistic insights into the regulatory pathways. We also developed IRTools, a computational toolset to quantify intron retention from RNA-Seq libraries at both gene and individual intron levels.
Theoretical modeling of evolution
Evolution is the guiding principle for the entire biological world. My research interests focus on mathematical modeling of evolution, including those motivated by laboratory evolutionary experiments.
Soft condensed matter and statistical physics
Soft condensed matter systems include polymer systems, liquid crystal systems, colloidal systems, granular systems and biological systems. Due to the interplay of extended objects, disorder and interaction, these systems exhibit very rich and interesting behaviors. My interests in this general area center on ordering and transport properties of such systems.