More help available here
.
For this homework, you will run a machine learning experiment to find
dates and names of people in a subset of the TDT-2 (Topic Detection and
Tracking) corpus of transcribed news broadcasts. You will be given:
The format of the training set is a set of text files formatted in
sgml. Dates and person names in the training set are bracketed by <DATE></DATE> and
<PERSON></PERSON>, respectively You
will also notice that other proper names, including Locations and
Company Names, are bracketed by <ORGANIZATION></ORGANIZATION>
In the training corpus, it is
thus simple to distinguish between these capitalized "entities". In
the development test corpus, however, and in a held-out test corpus
which you will be tested on, all entity bracketing has been removed.
Therefore, you will have to distinguish the Dates and Person Names you are
looking for from other named entities, and cannot assume (unless you
separately recognize these yourself) that any other named entity tags
are present. Note also that what is considered to be a date or a person
name is in effect defined by the bracketing in your training and devtest
corpora; this may be different from previous definitions you have used (e.g. in
Homework I).
Your goal is to build named entity taggers to tag (insert
brackets <foo></foo>) around all the dates and person names in
the held out test set. You must train your taggers on the TDT-2
training corpus and you may test your results yourself on the
development test set. You will use the devtest set to develop a set
of features which best permits the automatic tagging of the
corpus.
You will use the YALE machine environment, available at yale.cs.uni-dortmund.de. If the link doesn't work you can search for "Yet Another Learning Environment" on the web or you can download at yale-2.3.2-bin.tar.gz to develop
your named entity bracketer. After you build your entity taggers you must test them on the development set.
In a REPORT.txt you must report/answer the following questions
about each tagger:
Start Early!! Since this might seem a large project at first, you should plan how to divide the homework into parts. Thinking of this homework as text-processing, feature extraction, machine learning, testing and result interpretation may be a good way to divide it. First play around with the machine learning tool YALE. Think of the features you are going to use then process the text accordingly to extract features. Store the features in the right data format, load it in the ML tool and run the experiments accordingly.
You may want to look at some of the perl libraries for text processing
here http://www1.cs.columbia.edu/~smaskey/cs4705/hw2/perl_lib/.
You may also use any text processing programs (e.g. for
p.o.s. tagging, parsing, noun-phrase chunking) that you wish, but you
must document your use properly by saying what you used, who authored,
and where you obtained.
your_program_file should contain the programs for your date
and tagger.
run.sh is the same kind of file you created for each of the
questions for Homework I. It is a shell script that should take
a test file in the same format as the devtest file (i.e. without
brackets of any kind), feed it to your programs, and do the required
processing. A sample run-date.sh is shown here
REPORT.txt (e.g. REPORT-DATE.txt) should contain answers to
the 7 questions listed above.
README should briefly describe what you did, what additional
code or other resources you used, and how to run your program.
The top 3 performing taggers for each part of the homework (date
taggers and person name taggers) turned in for the class will win prizes for their authors.
Hints/Tips
What to hand in:
First Report Due 11/12/04:
In this report, you must provide
preliminary results for your DATE tagger, including the following files:
your_program_file (e.g. prog-date.pl)
run.sh for your tagger (e.g. run-date.sh)
REPORT.txt for your tagger (e.g. REPORT-DATE.txt)
README for the tagger (e.g. README-DATE)
--------------------------
#!/bin/sh
How to submit this report and the final homework:
After a short time you will get an automatic acknowledgement of your submission. Please note:
Now you must turn in the same files, but for both your DATE and PERSON taggers. I.e.,"
your_program_files (e.g. prog-date.pl, prog-person.pl) run.sh for each tagger (e.g. run-date.sh, run-person.sh) REPORT.txt for each tagger (e.g. REPORT-DATE.txt, REPORT-PERSON.txt) README for each tagger (e.g. README-DATE, README-PERSON)
NOTE: Be sure to submit this final homework as 'hw2-final' in the
submission script described above.
or, for the final submission, e.g.
$ tar cvf - . | compress | uuencode temp_file | Mail -s "submit cs4705
hw2-final" smaskey@cs.columbia.edu