Homework 3: Machine Learning Due: Dec 4 2003 Points: 133 For this homework assignment, you will be asked to perform some machine learning experiments on the TDT data you worked with in homework 2. You are expected to use 'ripper', a program available on the cluster machines for inducing classification rules from a set of preclassified examples. The ripper executable is located here: /proj/nlp/tools/ML/ripper2/ripper Here are some resorces to help get you aquainted with ripper: 1. '/proj/nlp/users/cs4705/ripper_howto.txt' is a step-by-step guide to running ripper, which may be a bit easier to follow at first than the man page. 2. To view the man page, type: 'man /proj/nlp/tools/ML/ripper.man'. 3. '/proj/nlp/users/cs4705/ml95-paper.ps' shows use of ripper by it's creator in some classification experiments of his own. 4. For some sample data, you may want to check out these directories: /proj/nlp/users/cs4705/pdata/ /proj/nlp/users/cs4705/tdata/ ======================================================================== OBJECTIVE Run a classification experiment to automatically learn rules to predict the topic of a document in the TDT corpus (in ripper terminology, a 'topic' is synonomous with a 'class'. The two terms will be used interchangeably). You may extract whatever features you chose. You will be working with these two files: /proj/nlp/users/cs4705/train.sgml /proj/nlp/users/cs4705/test.sgml You will be training on train.sgml and testing on test.sgml. Once you have obtained the best performance you can, supply a write up describing the feature set you used, the error rate on test.sgml, and the ripper .hyp file. Describe some surprises any surprises you encountered in the process. You must also submit one program that takes a training file and a testing file in .sgml format and produces the three necessary ripper files: .names, .data, and .test. This is necessary because we will be testing how well your feature set performs on a validation test set, which you will not have access to. ======================================================================== FURTHER GUIDELINES Each document in the TDT corpus has one of the following classes (topics): Conspiracy Crime and criminals Finance, Public Outer space--Exploration Strikes and lockouts For ease of processing and compatibility between homeworks, when reporting your results and rule sets, apply the following mapping: Conspiracy => CONSPIRACY Crime and criminals => CRIME Finance, Public => FINANCE Outer space--Exploration => SPACE Strikes and lockouts => STRIKES