Syllabus
- Introduction.
- Queries and Documents. Models of Information retrieval. The
Boolean model. The Vector model.
- Document preprocessing. Tokenization. Stemming. The Porter
algorithm. Storing, indexing and searching text. Inverted
indexes.
- Word distributions. The Zipf distribution. The Benford
distribution. Heap's law. TF*IDF. Vector space similarity and
ranking.
- Retrieval evaluation. Precision and Recall. F-measure. Reference
collections. The TREC conferences.
- Automated indexing/labeling. Compression and coding. Optimal
codes.
- String matching. Approximate matching.
- Query expansion. Relevance feedback.
- Text classification. Naive Bayes. Feature selection. Decision
trees.
- Linear classifiers. k-nearest neighbors. Perceptron. Kernel
methods. Maximum-margin classifiers. Support vector
machines. Semi-supervised learning.
- Lexical semantics and Wordnet.
- Latent semantic indexing. Singular value decomposition.
- Vector space clustering. k-means clustering. EM clustering.
- Random graph models. Properties of random graphs: clustering
coefficient, betweenness, diameter, giant connected component,
degree distribution.
- Social network analysis. Small worlds and scale-free
networks. Power law distributions. Centrality.
- Graph-based methods. Harmonic functions. Random walks.
- PageRank. Hubs and authorities. Bipartite graphs. HITS.
- Models of the Web.
- Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method.
- Hypertext retrieval. Web-based IR. Document closures. Focused crawling.
- Question answering
- Burstiness. Self-triggerability
- Information extraction
- Adversarial IR.
- Human behavior on the web.
- Text summarization
Other possible topics
- Discovering communities, spectral clustering
- Semi-supervised retrieval
- Natural language processing. XML retrieval. Text tiling. Human behavior on the web.
Home