COMS 6998 Search Engine Technology

Fall 2007

Thursdays 6:10-8 PM, first class: September 6, last class: December 6, final presentations: TBA (between Dec 14 and Dec 21)

833, Mudd

Dragomir R. Radev

Goal of the course

A significant portion of the information that surrounds us is in textual format. A number of techniques for accessing such information exist, ranging from databases to natural language processing.

Some of the most prestigious companies these days spend large amounts of money to build intelligent search engines that allow casual users to find what they want anytime, from anywhere, and in any language.

In this course, we will cover the theory and practice behind the implementation of search engines, focusing on a wide range of topics including methods for text storage and retrieval, the structure of the Web as a graph, evaluation of systems, and user interfaces.

Instructor

Dragomir R. Radev obtained a PhD in Computer Science from Columbia in 1999. He has taught numerous courses in Natural Language Processing, Databases, and Information Retrieval. He has received a number of awards including the Columbia CS department award for graduate student teaching, the University of Michigan UROP award for outstanding research mentorship, as well as the Gosnell Prize for excellence in political methodology. As undergraduate he has been on a team that finished in the top 10 at the ACM international computer programming contest and, later, has been the coach of Columbia's team for 4 years, taking it to three international finals. Dragomir has worked in different capacities for places like Microsoft Research, IBM Research, MITRE, AT&T Bell Labs, and Yahoo! Dragomir Radev is also the secretary of the Association for Computational Linguistics (ACL).

Format

The class will meet once a week for 2 hours.

The students will be responsible for 3 assignments and a final project. The final project will be presented in class in a poster session. There will be no final exam. Here are some sample assignments:

Final project

An open-ended research project in one of two categories:

Lecture Notes

Contact Information/Office Hours

Please email the QA list to get in touch with the professor and TAs. You will get a faster response if you email this list than emailing us individually! We will not be checking courseworks.

Office Hours:

Readings

WK1MRS1, MRS2, MRS5
WK2MRS7, MRS8
WK3MRS9
WK4MRS13,MRS14
WK5MRS15, MRS16
WK6MRS17
WK7MRS18, MRS19
WK8MRS20
WK9[Newman 2003], sections I, II, III, IV, VI, VII, and VIIIa
WK10MRS21
WK11MRS11, MRS 12, [Church and Gale '95]
WK12[Austin 2006], [Newman 2006], [Menczer 2004a] [Menczer 2004b]

Assignments

If you're new to Perl, here are a few Perl links.

Syllabus (dates are approximate)

Sep 06
[ppt]
  • Introduction.
  • Queries and Documents. Models of Information retrieval. The Boolean model. The Vector model.
Sep 13
[ppt]
  • Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes.
  • Word distributions. The Zipf distribution. The Benford distribution. Heap's law. TF*IDF. Vector space similarity and ranking.
Sep 20
[ppt]
  • Retrieval evaluation. Precision and Recall. F-measure. Reference collections. The TREC conferences.
  • Automated indexing/labeling. Compression and coding. Optimal codes.
Sep 27
[ppt]
  • String matching. Approximate matching.
  • Query expansion. Relevance feedback.
Oct 04
[ppt]
  • Text classification. Naive Bayes. Feature selection. Decision trees.
Oct 11
[ppt from previous lecture]
  • Linear classifiers. k-nearest neighbors. Perceptron. Kernel methods. Maximum-margin classifiers. Support vector machines. Semi-supervised learning.
Oct 18
[ppt]
  • Lexical semantics and Wordnet.
  • Latent semantic indexing. Singular value decomposition.
Oct 25
[ppt]
  • Vector space clustering. k-means clustering. EM clustering.
Nov 01
[ppt]
  • Random graph models. Properties of random graphs: clustering coefficient, betweenness, diameter, giant connected component, degree distribution.
  • Social network analysis. Small worlds and scale-free networks. Power law distributions. Centrality.
Nov 08
[ppt]
  • Graph-based methods. Harmonic functions. Random walks.
Nov 15
  • Guest lecture (speakers from Yahoo! and Google)
  • Duplicate Detection on the web - Sergei Vassilvitskii, Yahoo! Research, New York [PDF]
  • Natural Language Processing and Search - Sasha Blair-Goldensohn, Google, New York [PDF]
Nov 29
[ppt][ppt]
  • PageRank. Hubs and authorities. Bipartite graphs. HITS.
  • Models of the Web.
  • Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method.
  • Hypertext retrieval. Web-based IR. Document closures. Focused crawling.
  • Question answering
Dec 06
[ppt]
  • Burstiness. Self-triggerability
  • Information extraction
  • Adversarial IR.
  • Human behavior on the web.
  • Text summarization
Possible topics
  • Discovering communities, spectral clustering
  • Semi-supervised retrieval
Dec20
(7-10 pm)
  • Student presentations.
  • Additional topics (if time permits):
  • Natural language processing. XML retrieval. Text tiling. Human behavior on the web.