CS 4705 Homework 2

NEW: Please note that you are expected to divide the message data yourself
into training and test sets (about 80% training/20% test is a good proportion).

This homework is designed to give you experience doing corpus-based
research. You will collect a corpus of email messages, anonymize
their email addresses, put them into a canonical form, classify them
in several ways, extract some features from them, and perform some
analyses on them using rule based and machine learning techniques on
the features you extract to perform several kinds of automatic
classification. Your result will be some email filters that work more
or less well. An example of a similar study for voicemail can be
found in
Hirschberg & Ringel, CHI 2001.

The homework will be due in two stages.
Stage I involves collection
and preparation of the data for analysis. This will be due on 14
November. All collected messages will be combined for use by the
whole class in Stage II, which will involve corpus analysis; a larger
corpus will permit more interesting analyses and, hopefully, produce
better results. For this reason, it is essential that you follow the
specifications for corpus collection and preparation described below
and pay careful attention to the format of the sample files. Your
classmates will be depending upon you to produce high quality, correct

Stage II: Corpus Analysis. Due 13 December.

Once you have prepared and submitted the corpus described in
Stage I, start
work on Stage II. The class corpus collected in Stage I will be
ready for you to use by 19 November, but you should start preparing
the necessary scripts to perform the ngram and other feature
extraction, using your own corpus, as soon as you complete Stage I of
the homework. The corpus may be found in "/proj/nlpusers/ani/emailCorpus"
in the subdirs "raw", "canonic", and "class", which contain the .msg, .txt,
and classification files, respectively.

1. Prepare unigram, bigram and trigram statistics on both your own
email and the class's corpus.

a) Compare the ngram information you have collected for your
email to the larger class corpus using any metric that will
bring out the similarities and differences between the two
(e.g. you might compare the number of each class of ngram in
each corpus as an indication of vocabulary size, the top 25
ngrams in each as an indication of language differences, or any
other measure you think useful).

b) Now, using only the larger corpus, do similar comparisons for
msgs whose 'personal' rating is 3 vs. the rest; for msgs whose
'spam' rating is 3 vs. the rest; whose 'urgent' rating is 3
vs. the rest.

2. Again using the larger class corpus, devise a set of features that
might help you classify msgs as 'urgent' or not, 'personal' or not,
'spam' or not. (Note that different features may be more or less
effective for different classification tasks. You might want to
start with the ngrams you've analyzed in (II-1) as features and
also include features based on the times and dates delimited in the
corpus (I-4). You may also want to identify features particular to
certain header fields. For some ideas of other features you might
try, look at
Hirschberg & Ringel 2001.

3. Use the machine learning program Ripper to predict each of our
three classes of labels: spam, personal, and urgent. Ripper is
here, with papers by its creator, William Cohen and sample ripper input files and output files also for you to look at.