Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach

Ethan X. Fang, Min Dian Li, Michael I. Jordan, Han Liu

Research output: Contribution to journalArticlepeer-review

3 Scopus citations


Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is emerging as a useful approach to bridge functional genomics with disease risk loci. In this article, we use large-scale gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis, which are also applicable to a variety of large-scale data analyses. (ii) From an experimental perspective, our method generates an informative list of tumor-related TFs and their possible effected tumor types. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, many of which have not been reported before. In summary, our work established a robust method to identify the association between TFs and biological contexts. Given the limited amount of genome-wide binding profiles of TFs and the massive number of expression profiles, our work provides a useful tool to deconvolute the gene regulatory network for tumors and other biological contexts. Supplementary materials for this article are available online.

Original languageEnglish (US)
Pages (from-to)921-932
Number of pages12
JournalJournal of the American Statistical Association
Issue number519
StatePublished - Jul 3 2017

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Statistics, Probability and Uncertainty


Dive into the research topics of 'Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach'. Together they form a unique fingerprint.

Cite this