University of California, Irvine (UCI)
School of Information and Computer Sciences (ICS)
Institute for Genomics and Bioinformatics (IGB)

Probabilistic Modeling
of Biological Data

ICS 284B
Pierre Baldi



Course - Prerequisites - Textbook - Grading - Schedule - Other

Course Goals and Description

This is a graduate level course on probabilistic modeling of biological data. The course covers computational approaches to understanding and predicting the structure, function, interactions, and evolution of DNA, RNA, proteins, and related molecules and processes. The emphasis is on providing a unified Bayesian statistical framework to mine large noisy data sets that are becoming the hallmark of modern biology. The methods taught focus on developing the structure of the models, on model fitting algorithms (machine learning), and on the application of the resulting models (data mining). Most applications will revolve around DNA, RNA, protein sequence, and gene-expression-array data, but other types of  data will also be considered depending on participants interests.

The official catalog description is:

ICS 284B: Probabilistic Modeling of Biological Data. A unified Bayesian probabilistic framework for modeling and mining biological data. Applications range from sequence (DNA, RNA, proteins) to gene expression data. Graphical models, Markov models, stochastic grammars, neural networks, structure prediction, gene finding, evolution, DNA arrays single and multiple gene analysis.


Course - Prerequisites - Textbook - Grading - Schedule - Other


A basic course in algorithms (ICS 161 or equivalent) and in molecular biology (Bio Sci 99 or equivalent), or ICS 277A (or equivalent), or consent of instructor. Course assumes ome background in biology, and basic knowledge of probability, statistics, and programming.



Course - Prerequisites - Textbook - Grading - Schedule - Other


Bioinformatics: the Machine Learning Approach
Pierre Baldi and Soren Brunak, Second Edition, 2001, (MIT Press)

DNA Microarrays and Gene Regulation: From Experiments to Data Analysis and Modeling
Pierre Baldi and G. Wesley Hatfield, 2002, (Cambridge University Press).


Course - Prerequisites - Textbook - Grading - Schedule - Other


Students will read articles from the literature. Grading will be based on participation in class discussions, presentations, and possibly a final project requiring a computational analysis of biological data, which will result in a brief (5--10 pages) conference-style written report. Additional assignments can include homeworks.



Course - Prerequisites - Textbook - Grading - Schedule - Other

Tentative Schedule

N.B.: Schedule may change to follow class interest, schedule outside speakers, etc.

bullet Week 1: Introduction to Bioinformatics. Probabilistic Modeling: the Bayesian Statistical Framework.
bullet Week 2: Graphical Models. Simple Markov models of Biological Sequences (HMMs).
bullet Week 3: Hidden Markov Models of Biological Sequences.
bullet Week 4: HMMs, Probabilistic Models of Genes, and Gene Finding Algorithms. Probabilistic Models of Genes and Gene Finding Algorithms.
bullet Week 5: Probabilistic Models of Evolution and Phylogenetic Trees.Stochastic Grammars and Languages.
bullet Week 6: Stochastic Context Free Grammars and RNA Secondary Structure. Beyond Context Free Grammars.
bullet Week 7: Probabilistic Modeling and Neural Networks. Machine Learning Approaches for Protein  Structure Prediction.
bullet Week 8: Machine Learning Approaches for Other Problems (Signal Peptides, etc). DNA Microarray Data and Gene Regulation
bullet Week 9: Probabilistic Modeling of DNA MicroArrays: Single-Gene Level. Probabilistic Modeling of DNA MicroArrays: Multiple-Gene Level. Gene and Protein Networks. Systems Biology.
bullet Week 10: Project Presentations.



Course - Prerequisites - Textbook - Grading - Schedule - Other


Texts on reserve at the UCI Science Library

bullet Bioinformatics: the Machine Learning Approach by Pierre Baldi and Soren Brunak.
bullet DNA Microarrays and Gene Expression by Pierre Baldi and G. Wesley Hatfield
bullet Biological Sequence Analysis by Richard Durbin et al.
bullet Introduction to Protein Structure by Carl Branden and John Tooze.
bullet Introduction to Computational Biology by Michael S. Waterman.
bullet Artificial intelligence and molecular biology edited by Lawrence Hunter.
bullet Mathematical methods for DNA sequences edited by Michael S. Waterman.



Relation to Other Courses

This course is intended to complement the existing ``hands-on'' computer based courses Biological Sciences 123/223 (Computer Applications in Molecular Biology/Computational Molecular Biology), which give a very practical introduction to using computer tools in molecular biology. In contrast, this course emphasizes the development of probabilistic models and machine learning approaches for the analysis of biological data. This course is also intended to closely complement the existing ICS course``Representations and Algorithms for Molecular Biology'' (currently ICS-277 and scheduled to become ICS-277A). In contrast, this course emphasizes modeling and analysis of biological data using a probabilistic framework. The probabilistic approach is essential to account for biological variability brought about by evolutionary tinkering. The course can be viewed as data mining, machine learning, and probabilistic algorithms, concentrated on biological data sets, especially sequence data, but also including other data sets, such as gene expression data, depending on student interest.

There is essentially no overlap between this course and ICS 246, as well as ICS 248. There is a small overlap with ICS 275B and with 283. The overlap with 275B is in the use of graphical models. Not all the graphical models used in 277B, however, are Bayesian networks. Furthermore, the Bayesian networks used in 277B are very specialized and come with their own algorithms (forward-backward, inside outside) etc. There is also a small overlap with ICS 273 (machine learning) but the approach in 277B is more probabilistic and, once more, focused exclusively on biological problems. ICS 277B could benefit students who have taken ICS 275B and/or ICS 273 by deepening their understanding of graphical model/machine learning concepts and letting them apply systematically to problems in biology.

Finally, ICS 277B complements a course such as 223 (Molecular Biology and Biochemistry) by focusing on the application of computational methods to the solution of biological problems.

This course is part of the new ICS concentration: Informatics in Biology and Medicine.

Course - Prerequisites - Textbook - Grading - Schedule - Other


Back to Courses

© 2017 Pierre Baldi | pfbaldi [at] uci [dot] edu