Some of the most prestigious companies these days spend large amounts
of money to build intelligent search engines that allow casual users
to find what they want anytime, from anywhere, and in any language.
In this course, we will cover the theory and practice behind the
implementation of search engines, focusing on a wide range of topics
including methods for text storage and retrieval, the structure of the
Web as a graph, evaluation of systems, and user interfaces.
The students will be responsible for 3 assignments and a final
project. The final project will be presented in class in a poster session. There will be no final exam.
Here are some sample assignments:
Sep 06
[ppt]
|
- Introduction.
- Queries and Documents. Models of Information retrieval. The Boolean model. The Vector model.
|
Sep 13
[ppt] |
- Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes.
- Word distributions. The Zipf distribution. The Benford distribution. Heap's law. TF*IDF. Vector space similarity and ranking.
|
Sep 20
[ppt]
|
- Retrieval evaluation. Precision and Recall. F-measure. Reference collections. The TREC conferences.
- Automated indexing/labeling. Compression and coding. Optimal codes.
|
Sep 27 [ppt] |
- String matching. Approximate matching.
- Query expansion. Relevance feedback.
|
Oct 04 [ppt] |
- Text classification. Naive Bayes. Feature selection. Decision trees.
|
Oct 11 [ppt from previous lecture] |
- Linear classifiers. k-nearest neighbors. Perceptron. Kernel methods. Maximum-margin classifiers. Support vector machines. Semi-supervised learning.
|
Oct 18 [ppt] |
- Lexical semantics and Wordnet.
- Latent semantic indexing. Singular value decomposition.
|
Oct 25 [ppt] |
- Vector space clustering. k-means clustering. EM clustering.
|
Nov 01 [ppt] |
- Random graph models. Properties of random graphs: clustering coefficient, betweenness, diameter, giant connected component, degree distribution.
- Social network analysis. Small worlds and scale-free networks. Power law distributions. Centrality.
|
Nov 08 [ppt] |
- Graph-based methods. Harmonic functions. Random walks.
|
Nov 15
|
- Guest lecture (speakers from Yahoo! and Google)
- Duplicate Detection on the web - Sergei Vassilvitskii, Yahoo! Research, New York [PDF]
- Natural Language Processing and Search - Sasha Blair-Goldensohn, Google, New York [PDF]
|
Nov 29 [ppt][ppt] |
- PageRank. Hubs and authorities. Bipartite graphs. HITS.
- Models of the Web.
- Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method.
- Hypertext retrieval. Web-based IR. Document closures. Focused crawling.
- Question answering
|
Dec 06 [ppt] |
- Burstiness. Self-triggerability
- Information extraction
- Adversarial IR.
- Human behavior on the web.
- Text summarization
|
Possible topics
|
- Discovering communities, spectral clustering
- Semi-supervised retrieval
|
Dec20 (7-10 pm) |
|
|
- Additional topics (if time permits):
- Natural language processing. XML retrieval. Text tiling. Human behavior on the web.
|