CS M226 / BIOINF M226/ HUMGEN M226: Machine Learning for Bioinformatics (Fall 2017)

Lecture: Tuesday / Thursday 4:00pm - 6:00pm, Botany 325

Course description

What genes cause cancer ? Have we inherited genes from Neanderthals ? How does a single genome code for the diverse function that we see?

We can now begin to answer these fascinating questions in biology because the cost of genome sequencing has fallen faster than Moore's law. The bottleneck in answering these questions has shifted from data generation to powerful statistical models and inference algorithms that can make sense of this data. Statistical machine learning provides an important toolkit in this endeavor. Further, biological datasets offer new challenges to the field of machine learning.

We will learn about probabilistic models, inference and learning in these models, model assessment, and interpreting the inferences to address the biological questions of interest. The course aims to introduce CS/Statistics students to an important set of problems and Bioinformatics/Human Genetics students to a rich set of tools.

Prerequisites

Familiarity with probability, statistics, linear algebra and algorithms is expected. No familiarity with biology is needed.

Contact Info

Instructor: Sriram Sankararaman
Office Hours: Boelter 4531D, Wednesday 11:00 am - noon
Email: sriram at cs dot ucla dot edu

Teaching assistant: Gleb Kichaev
Office hours: CHS 33-355, Monday 2:00pm - 4:00 pm
Email: glebkichaev at gmail dot com

Textbooks

There is no formal textbook. Readings will be posted as needed. The following texts will serve as useful references:

Machine Learning: A Probabilistic Perspective by Kevin Murphy.
Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman
Biological Sequence Analysis by Richard Durbin, Sean Eddy, Anders Krogh and Tim Mitchison.
Principles of Population Genetics by Daniel Hartl and Andy Clark

Course format

Homework (50%): There will be periodic homeworks. Questions on the homework will include programming exercises and data analyses.
- We will use gradescope to manage submission of homeworks.
- Homeworks are due at 11:59pm on the due date.
- Late submissions will not be accepted
- All solutions must be clearly written (or typed) ; unreadable answers will not be graded. We encourage using LaTeX to type out answers.
- Solutions will be graded on both correctness and clarity. If you cannot solve a problem completely, you will get more partial credit by identifying the gaps in your argument than by attempting to cover them up.
- You can are strongly encouraged to use R (R is free software. See here for details ).
You are free to discuss the homework problems. However, you must write up your own solutions. You must also acknowledge all collaborators.
Exams (Mid-term: 20%, Final: 30%): There are two exams scheduled for Nov 7 and Dec 7. Exams are in-class, closed-book and closed-notes and will cover materials from the lectures and problem sets. No alternate or make-up exams will be administered, except for disability/medical reasons documented and communicated to the instructor prior to the exam date. In particular, exam dates and times cannot be changed to accommodate scheduling conflicts with other classes.

A tentative syllabus

Acknowledgments

The course website is based on material developed by Ameet Talwalkar and Fei Sha. Some of the administrative content on the course website is adapted from material from Jenn Wortman Vaughan, Rich Korf, and Alexander Sherstov.

Tentative Schedule

Date	Topics	Problem Sets
09/28	Introduction to genomics	Problem Set 0
10/03	Introduction to statistics. Multiple hypothesis testing.
10/05
10/10
10/12
10/17
10/19
10/24
10/26
10/31
11/02
11/07
11/09
11/14
11/16
11/21	No class
11/23	Thanksgiving
11/28
11/30
12/05
12/07