This homework is designed to give you
experience doing corpus-based
research. You will collect a corpus of email messages, anonymize
their email addresses, put them into a canonical form, classify them
in several ways, extract some features from them, and perform some
analyses on them using rule based and machine learning techniques on
the features you extract to perform several kinds of automatic
classification. Your result will be some email filters that work more
or less well. An example of a similar study for voicemail can be
found in Hirschberg & Ringel, CHI 2001.
The homework will be due in two stages. Stage I involves collection
and preparation of the data for analysis. This will be due on 14
November. All collected messages will be combined for use by the
whole class in Stage II, which will involve corpus analysis; a larger
corpus will permit more interesting analyses and, hopefully, produce
better results. For this reason, it is essential that you follow the
specifications for corpus collection and preparation described below
and pay careful attention to the format of the sample files. Your
classmates will be depending upon you to produce high quality, correct
Stage I: Corpus collection, clean-up and annotation. 150 pts. Due 14 November.
1. (35 pts) Collect a corpus of 100 email messages in English either from your
own incoming and/or outgoing email or someone else's donated email.
This corpus should contain no more than 25% spam and no more than
25% broadcast messages (e.g. talk announcements). You will be
given a unique id to use in numbering the messages. Id's will of
the form "fall02-N", where each of you will be assigned a unique N.
Each msg should be placed in a separate file, numbered
"fall02-N-M.msg, where M is a number you will assign to each
individual message, and thus will range between 001 and 100. So,
if you are assigned the id 9999, the first message in your corpus
will be fall02-9999-001.msg and the last will be
fall02-9999-100.msg. This file will contain the original of the
message, without any annotation or labeling, but with all email
addresses anonymized, as in (2) below. Cf. fall02-9999.msg.
Do *not* include any messages in your corpus that might embarrass
you or anyone else if read by others or that refers to anything illegal. Include a README file in your submission that states you are willing to allow your messages to be used for research purposes and that they do not contain anything that might cause others embarrassment or harm.
2. (15 pts) Write a program to anonymize all email addresses in these messages,
translating the username for each address into a corresponding
anonymous alias. You should preserve translation correspondence
across your corpus; i.e., firstname.lastname@example.org should always be
translated the same in all messages in which this address appears,
e.g. as email@example.com. You may use any correspondence you
like to translate these addresses, but please preserve at least the
final 3-letter suffix (e.g. .edu, .gov,...). fall02-9999-001.msg,
e.g., should contain only these anonymized email addresses. (If
you wish to also anonymize proper names, make sure the result is
also a proper name.) Include your anonymizer program in your submission.
3. (15 pts) Write a script to ransform all message files fall02-M-N.msg (e.g. fall02-9999-001.msg) into a canonical form by creating an ascii file fall02-M-N.txt (e.g. fall02-9999-001.txt in the following format:
Date: <day, time and date information as it appears in dateline>
From: <all names and email addresses as they appear in fromline>
To: <all names and email addresses as they appear in toline>
cc: <all names and email addresses as they appear in cc line>
Subject: <subject line information>
Body: <body of message in plain ascii, preserving capitalization,
punctuation, line breaks and paragraphing>. (NB: Everything that
follows the keyword 'Body:' here should be plain ascii. You should
all non-ascii attachments, e.g. Cf. fall02-9999.txt. Include your script in your submission.
4. (35 pts) Annotate the .txt files produced in (3) as follows:
a) Use the (corrected if necessary) time and date delimiter program
you wrote for Homework I to identify and label all absolute and
deictic dates in the body of your messages. This time, label the
times of day as <TIME> 3:47 a.m. </TIME> and dates <DATE> Tuesday,
June 1st </Date> separately. A guiding principle for determining a
time or date is, can you specify it on a clock or a calendar;
e.g. I can look at my watch and tell when 'now' is; I can look at a
calendar and tell when 'next year' is. If you can't tell whether
something is a time or a date (e.g. early Tuesday morning), label it all
as a date. If you run into a tricky example, ask. Include your (corrected or uncorrected) program in your submission.
b) Hand-correct your time and date delimiters' output so that your
final version of <prefix>N.txt correctly delimits all times and
dates in the corpus. NB: The better your delimiter program works,
the less hand labeling you will need to do...
5. (35 pts) Classify each of the messages in your corpus by hand as follows: In
3 separate ascii files with format specified below, rate each message
in your corpus from 1 to 3, where 1 is 'not at all', 2 is 'sort of'
and 3 is 'definitely' along three dimensions:
a) To what extent would you say this message is spam? (fall02-N.spam)
b) To what extent is this message personal? (fall02-N.pers)
c) To what extent did you consider this message 'urgent' when
you received it? (i.e., something you would have wanted to read
immediately or which required immediate action) (fall02-N.urg)
For each rating, create an ascii file with a 2x100 matrix
containing the msg id (e.g. fall02-9999-001) in the first column and the
message rating (1, 2 or 3) in the second, separated by a space.
You will thus produce 3 files, <prefix>.spam, <prefix>.pers, and
<prefix>.urg. C.f. fall02-9999.spam, fall02-9999.pers,
6. (15 pts) In your README file, describe any difficulties you had in deciding how
to anonymize (2), label (4) or classify (5) the data.
7. Place your README file, the programs you used to anonymize messages, produce canonical format, automatically delimit times and dates, and all of your .msg, .txt, .spam, .pers, and .urg files in a single directory for submission. Follow the submission guidelines for Homework 1 to submit Homework 2. All programs must run on a CS cluster machine under unix.