COM SCIM229S-2 / BIOL CHM229S-2 / HUM GENM229S-2: Machine Learning for Bioinformatics (Spring 2016)

Lecture: Tuesday / Thursday 4:00pm - 6:00pm, MS5137

Course description

What genes cause cancer ? Have we inherited genes from Neanderthals ? How does a single genome code for the different cells ?

We can now begin to answer these fascinating questions in biology because the cost of genome sequencing has fallen faster than Moore's law. The bottleneck in answering these questions has shifted from data generation to powerful statistical models and inference algorithms that can make sense of this data. Statistical machine learning provides an important toolkit in this endeavor. Further, biological datasets offer new challenges to the field of machine learning.

We will learn about probabilistic models, inference and learning in these models, model assessment, and interpreting the inferences to address the biological questions of interest. The course aims to introduce CS/Statistics students to an important set of problems and Bioinformatics/Human Genetics students to a rich set of tools.

Prerequisites

Familiarity with probability, statistics, linear algebra and algorithms is expected. No familiarity with biology is needed.

Contact Info

Instructor: Sriram Sankararaman
Office Hours: Boelter 4531D, Tuesday 10:00a - 11:00a (or by appointment)
Email: sriram at cs dot ucla dot edu

Textbooks

There is no formal textbook. Readings will be posted as needed. The following texts will serve as useful references:

Machine Learning: A Probabilistic Perspective by Kevin Murphy.
Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman
Biological Sequence Analysis by Richard Durbin, Sean Eddy, Anders Krogh and Tim Mitchison.
Principles of Population Genetics by Daniel Hartl and Andy Clark

Course format

Readings: Each class will be assigned one or two readings. At the end of the class, please post a short summary or comments or critiques on the readings to CCLE.
Scribed lecture notes: Each student will be assigned one lecture to scribe. The scribed lectures will be due one week after the assigned lecture. A latex template will be made available for scribing.
Homework: There will be three homeworks. Questions on the homework will include programming exercises and data analyses as well as questions drawn from the assigned readings. You are free to use a programming language of your choice though R is preferred. The homeworks must be submitted in hard copy in class on the day they are due. Late submissions will not be accepted.
You are free to discuss the homework problems. However, you must write up your own solutions. You must also acknowledge all collaborators.
Project: A major component of this course will be an open-ended project. The project can focus on the development of a statistical model/algorithm to a biological problem or application of an existing technique. I will post a list of potential projects on CCLE. You are welcome to propose any project that is relevant to the course, including rotation projects. Each group should decided on their project by the third week. The group will be expected to present their project in class near the end of the quarter and submit a project report.

Grading

Project: 50% (30% paper, 20% presentation)
Homeworks: 30%
Scribing: 10%
Readings: 10%

A tentative syllabus

Acknowledgments

The course website is based on material developed by Ameet Talwalkar and Fei Sha. Some of the administrative content on the course website is adapted from material from Jenn Wortman Vaughan, Rich Korf, and Alexander Sherstov.

Tentative Schedule

Date	Topics	Reading	HW
3/29	Introduction to genomics	Big Data: Astronomical or Genomical?
3/31	Introductory statistics.	Storey, False Discovery Rates
4/5	Multiple testing. Association studies: linear and logistic regression	Storey, False Discovery Rates
3/31	Guest lecture
4/12	GWAS. Bayesian statistics. Ridge regression	Eskin, CACM 2015 Okser et al. Regularized Machine Learning in the Genetic Prediction of Complex Traits	Homework 1 Data for Homework 1
4/14	Guest lecture
4/19	Bayesian and sparse regression. Linear Mixed Models. Heritability	Zuk et al. PNAS 2011 Additional: Yang et al. Nature Genetics 2010
4/21	Latent Variable Models: Clustering, mixture models and the EM algorithm
4/26	Latent Variable Models: PCA and admixture models
4/28	Application: Population structure and stratification
5/3	Directed Graphical Models
5/5	Hidden Markov Models	Li and Stephens Genetics 2003	Homework 2 Data for Homework 2
5/10	Undirected graphical models and trees.Sum-product algorithm and MCMC (Gibbs sampling). Application to admixture models
5/12	Approximate inference: MCMC (Metropolis-Hastings) and variational inference
5/17	Kernel machines and Gaussian process. Application: Rare-variant association test
5/19	Bayesian nonparametrics: Dirichlet process		Homework 3 Data for Homework 3
5/24	Class cancelled
5/26	Genomic privacy: Learning with privacy constraints
5/30	Project presentations
6/03	Project presentations