COMS 6998-04 Search Engine Technology ===================================== HW 1 Building a basic information retrieval system Instructor: Dragomir Radev TA: Samreen Dhillon Akshay Pundle Contact: cs6998-SET-qa@lists.cs.columbia.edu Due: Thursday, October 4th 2007, 6 pm In this project, you will build a basic information retrieval system. Given a local HTML corpus, your code will segment the corpus into documents, process the documents, and build an inverted index from the documents. After building the index, your code will take a query, process it, find and rank documents matching that query, and return a ranked list of results with document snippets containing the query. Your code will also handle several other requests to get statistics from the index. The code should be as efficient as possible, so you will report the amount of time spent indexing and handling queries, as well as the amount of disk space used for the index. Extensions are possible for extra credit. Before you start ================ 1. Getting a CS account set up You will need CS accounts for this project. If you dont have a CS acocunt already, please use the following resource for information on how to set one up https://www.cs.columbia.edu/~crf/accounts/ 2. What is Clairlib? Read about Clairlib at the Clairlib main page : http://belobog.si.umich.edu/mediawiki/index.php/Main_Page Clairlib documentation is available at : http://belobog.si.umich.edu/mediawiki/index.php/Documentation Clairlib Manual : The complete clairlib manual is available as a pdf from your CS account. The path to this file is : /home/cs6998/clairlib/clairlib-fall07/clairlib-core/clTut.pdf The pdf is also available from : http://belobog.si.umich.edu/clair/clairlib/clTut.pdf You can access the clairlib pdoc at : http://belobog.si.umich.edu/clair/clairlib/pdoc/ 3. Using Clairlib with your CS account: Using Clairlib with your CS account is very easy. Log into your CS account on the clic machines and execute the following commands : export PERL5LIB=/home/cs6998/clairlib/clairlib-fall07/clairlib-core/lib:/home/cs6998/clairlib/perl-fall07:/home/cs6998/clairlib/perl-fall07/lib/perl5/site_perl/5.8.5 export PATH=$PATH:~cs6998/clairlib/clairlib-fall07/clairlib-core/util/ The first command tells perl where it will be able to find the Clairlib modules. The second command tells your shell where it shold look for the Clairlib util programs. You should be able to run the Clairlib utilities programs now. You could also include these commands as a part of your .bash_profile script etc. Preassignment ============= To get familiar with Clairlib, you are required to complete the Chemical example from the following tutorial : http://belobog.si.umich.edu/mediawiki/index.php/Utilities_Tutorial The tutorial can also be found in the Clairlib manual (pdf) in chapter 6. You can safely skip the Installation step since Clairlib has already been installed and can be used easily from your CS account. Programming languages ===================== This assignment should be implemented in perl. The advantages are twofold: (a) CPAN and (b) clairlib. Both resources will help you with the assignment significantly. Your code should run on a standard linux machine. If any external libraries are needed, they should be included in your code submission along with an installation script. The TAs will only be allowed to do the following: % cd tas-test-dir-for-df2141 (where df2141 stands for your login ID) % gunzip df2141-install.tar.gz % tar xf df2141-install.tar % ./install --> your system's output here % ./index /path/to/raw/shakespeare/data --> your system's output here % ./query query1 --> your system's output here query2 --> your system's output here request1 --> your system's output here request2 --> your system's output here Grading ======= Working code: 40% API and documentation: 40% Efficiency and accuracy: 20% (up to 10% bonus will be given for particularly innovative extensions) Documentation ============= You should use perldoc/pdoc or equivalent to generate API documentation. Here is an example that uses pdoc: http://tangra.si.umich.edu/clair/clairlib/pdoc/ Requirements ============ 1. Standardized input document representation This assignment uses the collected plays of William Shakespeare (in HTML) as a testbed. Your code, however, should in principle work with arbitrary textual documents. You will therefore need to come up with a generic representation of your input documents. It can be something very simple, e.g., XML of this sort: ... ... ... ... ... ... ... ... In the future, if you need to index another type of input documents, all you should have to do is write a converter to the generic representation and the rest of the code should remain the same. 2. Queries You should handle at least the following query types: cat : returns any document that has the word "cat" in it cat dog : any document that has one or more of these words ("fuzzy or" is assumed by default) cat dog rat : up to 10 words in a query "tabby cat" : phrases of up to 5 words in length "small tabby cat" "shaggy dog" : multiple phrases in a query !cat !"tabby cat" : negations of single words or phrases !cat !dog : multiple negations per query You don't need to worry about nested queries in parentheses. Note that the default Boolean operator ("fuzzy or") above is an extension to "AND" and "OR" and supersedes both. 3. Ranking If multiple documents match, you should order them by decreasing score. The score is defined as the number of query terms (or phrases) that match in the "fuzzy or", including repetitions and counting phrases as the number of words that are included in them. Negated terms are not included in the score. For example, if the query is cat dog "pack rat" and you have three documents D1, D2, and D3 that contain at least one of the query terms as follows: D1: cat cat dog cat mouse D2: pack rat cat rat rat D3: rabbit elephant dog dog cat pack rat cat their scores should be as follows: D1: 4 D2: 3 ("pack rat" counts as two terms, but "rat" alone doesn't count) D3: 6 4. Extensibility Your code should be extensible. In a future assignment, if you need to add hyperlink functionality (e.g., indexing anchor text or running pagerank), it should be easy to do so. 5. Documents It is your responsibility to segment the input text into document units (e.g., pages, chapters, acts, units of N words, ...) . Each document should however have a title and an ID. A document should not be too long (an entire play) or too short (e.g., a paragraph). Components ========== Your system should consist of an API and a runtime part (a set of command-line based programs that use the API). API: You need to have the following classes (or modules). I have listed some typical methods that each class should implement. Not all items below correspond to specific methods. At the same time, you will probably need more methods in addition to the ones on the list. Items in parentheses are not required for this assignment but would in general be part of a practical system. Furthermore, the list of modules is only shown as a guide. You can deviate from it if needed. You can subclass the clairlib modules or create new versions of them. Clairlib already implements many of these methods. Note that this assignment is not web-specific. We will have a separate web-specific assignment later. Document - html parsing - stemming - lowercasing - tokenization - document segmentation - convert to common representation - summarization (pick snippets that contain the query terms) Statistics - number of documents - word distributions (stemmed or not) - size of index Index - build an inverted index - storing metadata Not required: - (add to an index) - (delete from index) - (optimizing the index) Query - preprocess Retrieval - document matching - document ranking Evaluation - timing the indexing part - timing a retrieval Error - print appropriate error messages if incorrect commands are typed RUNTIME: You should provide the following runtime programs: - install - index - query Here is what each of them should do: install - install your code in the "current" directory - your install script should not include the Shakespeare data set but should instead allow the user to give a path to it (the standard path under /home/cs6998/hw1/Shakespeare/) index - convert the raw Shakespeare data into the common representation - build an index - give some basic statistics about the indexing, including time elapsed for index, amount of disk space used, number of words, number of documents, etc. query - read queries from standard input and prints short summaries of matching documents as well as document ids - read requests from standard input and prints back statistics - see above for the required support for query types - the following request types should be supported: df "pack rat" - shows the number of documents in which "pack rat" appears in the index df rat - shows the number of documents in which rat appears in the index freq "pack rat" - shows how many times the phrase "pack rat" appears in the index doc 4562 - shows the full text of document 4562 (ideally through a pager) tf 4562 rat - shows the term frequency of rat in document 4562 title 4562 - show the title of document 4562 Hints ===== Try to stay within the clairlib paradigm as much as possible. Try to write your code in such a way that it can potentially be added to clairlib. Sample additional features (for bonus points: maximum 10 bonus points) ====================================================================== - handle sophisticated queries - perform vector retrieval - web interface - improve clairlib - consistent use of XML throughout the system - proximity-based reranking - score normalization by document length - score determined by idf Deliverables ============ A single file called yourUNI-install.tar.gz (e.g., abc1234-install.tar.gz) with the following components: - your API code - your modifications to clairlib (if appropriate) - any external libraries needed (if appropriate) - the install, index, and query programs - documentation for each module and/or script included (in perldoc, javadoc, or equivalent format) Unarchiving your *.tar.gz file and running ./install should be all that we need to do to access your runtime code (index and query). Submission ========== We will be sending specific submission instructions seperately. Important ========= If we cannot run the code as specified above, we will not grade it. We will return it to you and penalize you by 10 points for each time we have to go back to you for a new version. Late Policy: This assignment is due at 6PM on 10/04. Each day (or fraction of a day) that your homework is late, your grade will decrease by 10 points. Final notes =========== We will spend time on Thursday in class to discuss the assignment and clarify some ambiguities. It would be great if you could do the pre-assignment and read the actual assignment before class.