This homework is designed to give you
experience using simple finite-state transducers to
normalize date and time expressions.
Your goal is to be able to identify date and time expressions in online text and then to convert these into canonical form.
We will provide some sample files containing dates and times. Your program should be able to handle all these examples at a minimum. Extra credit will be given for identifying other ways of saying dates and times and for handling those in your program.
You should create 2 programs: a) One should identify all date and time expressions and bracket them as follows: b) The second program should take this bracketed output as input and should transduce the date and time expressions into the following canonical forms: Date: 02-06-1997 is February 6, 1997, the 6th of February, 1997, etc. Time: 02:06:00 is 2:06 a.m. Your programs should pass along all other text in the input, preserving all formating (e.g. line breaks, paragraph breaks, capitalize, punctuation,and white space) and transforming only the time and date expressions. For example.... Body: <body of message in plain ascii, preserving dates: anything you can find on a calendar times: anything you can point to on a clock
4. (35 pts) Annotate the .txt files produced in (3) as follows:
a) Use the (corrected if necessary) time and date delimiter program
you wrote for Homework I to identify and label all absolute and
deictic dates in the body of your messages. This time, label the
times of day as <TIME> 3:47 a.m. </TIME> and dates <DATE> Tuesday,
June 1st </Date> separately. A guiding principle for determining a
time or date is, can you specify it on a clock or a calendar;
e.g. I can look at my watch and tell when 'now' is; I can look at a
calendar and tell when 'next year' is. If you can't tell whether
something is a time or a date (e.g. early Tuesday morning), label it all
as a date. If you run into a tricky example, ask. Include your (corrected or uncorrected) program in your submission.
b) Hand-correct your time and date delimiters' output so that your
final version of <prefix>N.txt correctly delimits all times and
dates in the corpus. NB: The better your delimiter program works,
the less hand labeling you will need to do...
5. (35 pts) Classify each of the messages in your corpus by hand as follows: In
3 separate ascii files with format specified below, rate each message
in your corpus from 1 to 3, where 1 is 'not at all', 2 is 'sort of'
and 3 is 'definitely' along three dimensions:
a) To what extent would you say this message is spam? (fall02-N.spam)
b) To what extent is this message personal? (fall02-N.pers)
c) To what extent did you consider this message 'urgent' when
you received it? (i.e., something you would have wanted to read
immediately or which required immediate action) (fall02-N.urg)
For each rating, create an ascii file with a 2x100 matrix
containing the msg id (e.g. fall02-9999-001) in the first column and the
message rating (1, 2 or 3) in the second, separated by a space.
You will thus produce 3 files, <prefix>.spam, <prefix>.pers, and
<prefix>.urg. C.f. fall02-9999.spam, fall02-9999.pers,
6. (15 pts) In your README file, describe any difficulties you had in deciding how
to anonymize (2), label (4) or classify (5) the data.
7. Place your README file, the programs you used to anonymize messages, produce canonical format, automatically delimit times and dates, and all of your .msg, .txt, .spam, .pers, and .urg files in a single directory for submission. Follow the submission guidelines for Homework 1 to submit Homework 2. All programs must run on a CS cluster machine under unix.