NEW: Please
note that you are expected to divide the message data yourself
into training and test sets (about 80% training/20% test is
a good proportion).
This homework is designed to
give you experience doing corpus-based
research. You will collect a corpus of email messages, anonymize
their email addresses, put them into a canonical form, classify
them
in several ways, extract some features from them, and perform
some
analyses on them using rule based and machine learning techniques
on
the features you extract to perform several kinds of automatic
classification. Your result will be some email filters that work
more
or less well. An example of a similar study for voicemail can be
found in Hirschberg
& Ringel, CHI 2001.
The homework will be due in two stages. Stage I involves collection
and preparation of the data for analysis. This will be due on 14
November. All collected messages will be combined for use by the
whole class in Stage II, which will involve corpus analysis; a
larger
corpus will permit more interesting analyses and, hopefully,
produce
better results. For this reason, it is essential that you follow
the
specifications for corpus collection and preparation described
below
and pay careful attention to the format of the sample files. Your
classmates will be depending upon you to produce high quality,
correct
data.
Stage II: Corpus
Analysis. Due 13 December.
Once you have prepared and submitted the corpus described in Stage I, start
work on Stage II. The class corpus collected in Stage I will be
ready for you to use by 19 November, but you should start
preparing
the necessary scripts to perform the ngram and other feature
extraction, using your own corpus, as soon as you complete Stage
I of
the homework. The corpus may be found in
"/proj/nlpusers/ani/emailCorpus"
in the subdirs "raw", "canonic", and
"class", which contain the .msg, .txt,
and classification files, respectively.
1. Prepare unigram, bigram and trigram statistics on both your
own
email and the class's corpus.
a) Compare the ngram information you have collected for your
email to the larger class corpus using any metric that will
bring out the similarities and differences between the two
(e.g. you might compare the number of each class of ngram in
each corpus as an indication of vocabulary size, the top 25
ngrams in each as an indication of language differences, or any
other measure you think useful).
b) Now, using only the larger corpus, do similar comparisons for
msgs whose 'personal' rating is 3 vs. the rest; for msgs whose
'spam' rating is 3 vs. the rest; whose 'urgent' rating is 3
vs. the rest.
2. Again using the larger class corpus, devise a set of features
that
might help you classify msgs as 'urgent' or not, 'personal' or
not,
'spam' or not. (Note that different features may be more or less
effective for different classification tasks. You might want to
start with the ngrams you've analyzed in (II-1) as features and
also include features based on the times and dates delimited in
the
corpus (I-4). You may also want to identify features particular
to
certain header fields. For some ideas of other features you might
try, look at Hirschberg & Ringel 2001.
3. Use the machine learning program Ripper to predict each of our
three classes of labels: spam, personal, and urgent. Ripper is
described here, with papers
by its creator, William Cohen and sample ripper input files and
output files also for you to look at.