Homework 2 (NLP, Fall 2004)

Building Named Entity Tagger Using Machine Learning Techniques

First Report Due 11/12/04:

Final Homework Due 12/1/04:

Points: 266

More help available here


For this homework, you will run a machine learning experiment to find dates and names of people in a subset of the TDT-2 (Topic Detection and Tracking) corpus of transcribed news broadcasts. You will be given:

  • a training set from TDT-2 /proj/nlpusers/cs4705_fall04/train_data
  • a development test set from TDT-2 /proj/nlpusers/cs4705_fall04/test_data
  • the labeling manual specifying the labeling conventions and format (note that, however, there may be some differences from this in the data introducing a bit of noise (unintentionally).

    The format of the training set is a set of text files formatted in sgml. Dates and person names in the training set are bracketed by <DATE></DATE> and <PERSON></PERSON>, respectively You will also notice that other proper names, including Locations and Company Names, are bracketed by <ORGANIZATION></ORGANIZATION> In the training corpus, it is thus simple to distinguish between these capitalized "entities". In the development test corpus, however, and in a held-out test corpus which you will be tested on, all entity bracketing has been removed. Therefore, you will have to distinguish the Dates and Person Names you are looking for from other named entities, and cannot assume (unless you separately recognize these yourself) that any other named entity tags are present.  Note also that what is considered to be a date or a person name is in effect defined by the bracketing in your training and devtest corpora; this may be different from previous definitions you have used (e.g. in Homework I).

    Your goal is to build named entity taggers to tag (insert brackets <foo></foo>) around all the dates and person names in the held out test set. You must train your taggers on the TDT-2 training corpus and you may test your results yourself on the development test set. You will use the devtest set to develop a set of features which best permits the automatic tagging of the corpus.

    You will use the YALE machine environment, available at yale.cs.uni-dortmund.de. If the link doesn't work you can search for "Yet Another Learning Environment" on the web or you can download at yale-2.3.2-bin.tar.gz to develop your named entity bracketer. After you build your entity taggers you must test them on the development set.

    In a REPORT.txt you must report/answer the following questions about each tagger:

    1. What is the Precision, Recall, F-Measure and Classification Accuracy on training set, development set?
    2. Report the above results with at least 2 machine learning algorithms (you may use the same or different ones for each tagger).
    3. Describe briefly the machine algorithms you chose? (5 to 10 sentences only)
    4. Did you obtain different results when you tested on the training set and testing set? Why?
    5. Describe the features you used in your training.
    6. Discuss why you think the best-performing features performed well and why other features did not, and which additional features you would try if you continued this work.
    7. Describe any surprises you encountered in the process (e.g. features you thought would be good predictors but weren't, issues that came up in data preprocessing, and so on)
    The top 3 performing taggers for each part of the homework (date taggers and person name taggers) turned in for the class will win prizes for their authors.


    Start Early!! Since this might seem a large project at first, you should plan how to divide the homework into parts. Thinking of this homework as text-processing, feature extraction, machine learning, testing and result interpretation may be a good way to divide it. First play around with the machine learning tool YALE. Think of the features you are going to use then process the text accordingly to extract features. Store the features in the right data format, load it in the ML tool and run the experiments accordingly.

    You may want to look at some of the perl libraries for text processing here http://www1.cs.columbia.edu/~smaskey/cs4705/hw2/perl_lib/. You may also use any text processing programs (e.g. for p.o.s. tagging, parsing, noun-phrase chunking) that you wish, but you must document your use properly by saying what you used, who authored, and where you obtained.

    What to hand in:

    First Report Due 11/12/04:

    In this report, you must provide preliminary results for your DATE tagger, including the following files:
    your_program_file (e.g. prog-date.pl)
    run.sh for your tagger (e.g. run-date.sh)
    REPORT.txt for your tagger (e.g. REPORT-DATE.txt)
    README for the tagger (e.g. README-DATE)

    your_program_file should contain the programs for your date and tagger.

    run.sh is the same kind of file you created for each of the questions for Homework I. It is a shell script that should take a test file in the same format as the devtest file (i.e. without brackets of any kind), feed it to your programs, and do the required processing. A sample run-date.sh is shown here

    #Example: if your program filename for question 6 is prog-date.pl and you
    use 'perl prog-date.pl filename' to run it then the above line should read
    'perl prog-date.pl $1'.  

    REPORT.txt (e.g. REPORT-DATE.txt) should contain answers to the 7 questions listed above.

    README should briefly describe what you did, what additional code or other resources you used, and how to run your program.

    How to submit this report and the final homework:

                            $ tar cvf - . | compress | uuencode temp_file | Mail -s "submit cs4705 hw2-first" smaskey@cs.columbia.edu


    After a short time you will get an automatic acknowledgement of your submission. Please note:

    If you submit once, and then decide to submit again, your second submission will overwrite the first. All the files from your first submission will automatically be wiped out.

    Final Homework Due 12/1/04:

    Now you must turn in the same files, but for both your DATE and PERSON taggers. I.e.,"

    your_program_files (e.g. prog-date.pl, prog-person.pl)
    run.sh for each tagger (e.g. run-date.sh, run-person.sh)
    REPORT.txt for each tagger (e.g. REPORT-DATE.txt, REPORT-PERSON.txt)
    README for each tagger (e.g. README-DATE, README-PERSON)

    NOTE: Be sure to submit this final homework as 'hw2-final' in the submission script described above. or, for the final submission, e.g.

    $ tar cvf - . | compress | uuencode temp_file | Mail -s "submit cs4705 hw2-final" smaskey@cs.columbia.edu