Final Study Guide
General comments
- The final test is open-book: you can refer to your text, your lecture notes, and your study notes
- The final test covers material since the midterm -- it does not focus on the biosequence analysis we did before the midterm. However, many of the early ideas from the course are required for understanding the material since.
- You should know basic rules of probability and ideas from machine learning (e.g. training a model, log odds score, linear classifier, etc), but you will not be asked to reproduce the longer derivations from class or to prove new results
- You should be able to give basic biological motivation for the biosequence and gene expression analysis problems we have discussed
- You should be able to describe all the algorithms and methods we discussed in class and give examples of their use, but you will not be asked to formally derive or justify them.
Topics
- Neural nets for secondary structure prediction
- Protein primary, secondary, tertiary structure
- Alpha helix, beta strand, coil
- Neural net topology for secondary structure prediction
- Basic idea of what neural nets are
- Gene expression data and clustering
- What is gene expression, what do microarrays measure, cDNA vs oligonucleotide arrays
- Difference between clustering and classification, supervised and unsupervised machine learning
- Clustering algorithms
- Hierarchical clustering
- k-means clustering
- Goals of clustering gene expression data
- Classification
[References: Golub paper, Burges SVM tutorial]
- k-nearest neighbor
- Fisher's linear discriminant
- Support vector machines
- Linear classifiers, geometric margin
- Hard margin SVM, dual optimization problem, interpretation of the weights (Lagrange multipliers)
- Soft margin SVMs
- Kernel trick, examples of kernels
- Bayes nets for Regulatory Network Inference
[References: Hartemink paper, inferring subnetworks paper, minreg paper ]
- Basic ideas of transcriptional regulation: promoter regions for genes and transcription factors
- Bayes net models: graphical model, conditional independencies, joint probability distribution, parameters, arrows do not imply causality
- Hartemink paper: Bayesian score, statistically validating candidate models
- Pe'er "Inferring subnetworks" paper: use of knock-out data modeled by "interventions" in graph, greedy search for high scoring structure, bootstrapping to get high confidence features
- Pe'er "Minreg" paper: simplified network model, mutual information score, goal of minreg algorithm
- Motif discovery
[References: REDUCE paper, MEME paper ]
- REDUCE algorithm -- understand the basic model and motivation
- MEME algorithm -- mixture model and use of expectation maximization
- Gene-finding
[References: Burge paper on GENSCAN, Haussler review article, TWINSCAN paper, Comparative annotation paper]
- Different approaches to gene-finding: biological, database-driven, microarray techniques, computational gene-finding models
- Gene-finding in prokaryotes, ORFs
- GENSCAN model
- Hidden semi-Markov model
- State diagram for GENSCAN, how is models genomic sequence
- Prediction using "Viterbi parse"
- TWINSCAN model
- Use of orthologous sequences, convervation sequence
- How TWINSCAN adds conservation information to the GENSCAN model
- Whole-genome comparative annotation
- Graph-theoretic comparison of full yeast genomes
- Basic idea (from Lander paper) of how to use comparative genomics for motif discovery
- Protein Classification
[References: Fisher kernel paper, mismatch kernel paper ]
- SCOP hierarchy, remote homology detection problem
- Fisher kernel for profile HMMs
- String kernels like the mismatch kernel