6998 Section 3, NLP for the Web

Spring 2008

 


Assignments

There will be three types of assignments in this class, presentation of readings, discussant and semester project. Students are also expected to read all assigned papers and participate in the in-class discussion on the papers. The weight of these assignments towards the final grade is as follows:

Class Presentation

You will be assigned one or more of the papers for one class. Your job is to prepare a 10 minute overview of the paper. You should provide a framework for the approach taken in the paper, highlighting the main points and contributions. You should point out any claims or results that you find controversial. If there is a technical point of the paper that you think will be difficult to understand, you might select this point as the main part of your presentation and spend time explaining how you think it works.

Since the class is discussion oriented, everyone will give their presentation seated as part of the circle. I would prefer that people not use powerpoint. If you feel you need something written to help you make your points or to help in your explanation of a technical point, you may have a 1 page handout for the class. However, if you feel strongly that you will do much better using powerpoint, let me know and that will be an option. We will evaluate as a class whether the style of presentation is working Your presentation will be strictly timed in order to allow enough time for discussion. Expect to be cut off if you go over time.

Discussant

As discussant, your job is to raise questions about the papers for discussion. In order to have an interesting discussion, think about higher level issues that the class papers raise, rather than detail oriented questions. For example, you may think about controversial claims, you may think about pros and cons of the approach, you may think about points of agreement or overlap between the different papers, you may think about directions for future research, you may think about implications for the field and whether it is on the right track.

Questions must be prepared and distributed to the class by email 24 hours ahead of time (i.e., by 4:10pm on Wednesday the day before). The two discussants should discuss among themselves how they want to start off the discussion. It should be shared in an integrated way. That is, it should not happen that one discussant takes a turn first and then the next discussant goes. You will get plenty of help from me in directing the discussion.

Interim Results

In your proposal you have suggested the modules that you will finish by March 13th. In addition, you have received feedback whether your suggestions were reasonable and/or whether you should turn in additional results. Now, create a web page for your contributions to this course. Submit the URL to your webpage in courseworks (in a text file). The webpage does not have to be anything facy, it should contain, for example:

Your code (for the modules that you have completed) as a zip file (It should compile) plus a readme describing how to run the modules, examples of input and output.

Your primary results and a short write-up of what you have done so far and

A link to the corpora you have downloaded/preprocessed (if there is one)

Further clarification on your remaining plans.

Anything else you think will be helpful (figures, charts,...)

 

Final Project

For the final submission you will need to do the following

·        Class presentation: This should be a 10 minute presentation. It will be strictly timed as conference presentations are timed. You will be given a 5 minute warning, 2 minute warning and 1 minute warning. Going overtime will cause you to lose points. The presentation should provide an overview of your goals, results (which might be charts showing accuracy or examples of output), and demo.

·        Face-to-face grading session: You need to arrange a time to meet with Fadi. Those who present on May 1st should meet with Fadi on May 2nd. Those who present on May 15th will meet with him on May 16th. The face-to-face session will focus more on the implementation. You should be prepared to:

§         Run the project on 5-10 examples

·        At least 3 new ones

§         Review the components you implemented

·        Including showing the code

§         Describe the components that you used from elsewhere and why you chose them

§         Explain the data you used and why

·        5 page write-up that includes

§         Overview of Project (What was it in the end?)

§         Components (Make this the 2nd section of the paper and format as itemized list)

§         Results

§         Conclusions

§          Summary of how feasible

§          Future directions

§         Anything else that you think is important

·        IMPORTANT DATES

§         Presentations: May 1st (4:10-end) and May 15th (4pm to 7pm)

§         Face-to-face sessions: May 2nd and May 16th

§         Write-up: Due by May 12th

 

Projects

You may design your own project or you may choose one of the suggested projects below. In either case, you should discuss your project with both Prof. McKeown and the TA, Fadi Biadsy, before submitting your proposals. Some possible projects include:

 

Fluency detection:

When developing a question answering or summarization system over multilingual and multimodal data, the system will need to be able to deal with noisy data. We will assume that the system includes both an MT component and an ASR (Automatic Speech Recognition) component. Output from the components will be errorful, sometimes so errorful that the resulting sentence is incomprehensible. One way to deal with these sentences is to skip them entirely for inclusion in a response or a summary. But how do we know which sentences to skip? This project will involve development of a component to detect disfluent sentences. You might include a variable threshold set by the user that says what level of disfluency will be allowed or disallowed. More specifically, given a list of English, MT and ASR sentences, the system should retain only the sentences that are understandable. You might try the following approaches in some combination: use a language model to rank more highly those sentences which best match the model, test whether the sentences parse, use a search engine to search the web for sentences that have similar constituents, and/or train class-based language modesl instead of a raw language model where the classes might be the Named Entities returned by named entity tags.

 

Sentence Simplification:

In the class on single document summarization, we will look at methods to generate abstracts instead of extracts. One way to generate abstracts would be to first simplify all complex sentences in the input, creating multiple shorter grammatical sentences for each sentence in the input. Then the summarizer could select only a portion of each sentence in the input document, filtering out pieces of the sentence that are irrelevant or not salient. In this project, you will develop a machine learning system that will use a corpus of parsed sentences to determine how to simplifying, removing conjunction ("and","or"), generating different sentences for prepositional phrases, etc. Alternatively, you might implement a GUI that has a check box for all possible components (relative clause, appositives, ...) and allow the user to determine which components should be removed from the sentence.

For example, given the following sentence:

Abbas, the Palestinian president, visited China, which will host the Olympics, last week.

In GUI, remove:

  • [x] relative clauses
  • [x] appositives

Output: Abbas visited China last week.

Or instead of GUI, you can return all possible grammatical sentences that can be extracted from the input sentence (total 2^n sentences, but typically n is small),

For example:

·         Abbas, the Palestinian president, visited China, which will host the Olympics, last week.

·         Abbas, the Palestinian president, visited China last week.

·         Abbas visited China, which will host the Olympics, last week.

  • Abbas visited China last week.

If you are interested in this project, you will interact with Dr. Owen Rambow of CCLS.

 

Multi-label classification of topic and attitude of user-written Web posts

We want to analyze automatically how people write reviews on the Web (reviews of products or restaurants for instance). While reviews are very helpful to Web users, it can be overwhelming sometimes for someone to have to go through dozens of reviews to decide what the consensus is. Our long-term goal is to design algorithms to summarize and search better user-written reviews. Our first step towards this goal is to add structure to this genre of texts. We focus on reviews of restaurants. We have already collected a large corpus of restaurant reviews from the website newyork.citysearch.com and have pre-processed it. We want to design a tool that predicts for each sentence in a review what the topic of the sentence is (is it about food, price, or atmosphere for instance) of and the attitude of the reviewer (negative, neutral, positive). We have already annotated a good-size corpus of sentences in our corpus. The project will focus on selecting features and machine learning approaches to solve the problem of topic and attitude prediction.

If you choose this project, you will interact with Prof. Noemie Elhadad of the Department of Biomedical Informatics.

 

Analysis of the Content of Web Posts for Patients

This project deals with web posts written in forums or mailing lists in websites for patients of specific disease. One reason this media is successful to patients is that they are a venue for them to exchange tips but also get emotional support about their condition. The goal of this project is to create tools to help analyze the content of these posts. One interesting research question is to characterize the information exchanged on these venues; are there specific characteristics to the language of these posts? To what extent is the language used emotional or on the contrary purely focused on objective information? Keywords: text classification, lexicon of emotions, forum posts, health consumer informatics.

If you choose this project, you will interact with Prof. Noemie Elhadad of the Department of Biomedical Informatics.

 

Alignment of Simple and Complex Medical Texts

This project focuses on articles from Wikipedia and "Simple English Wikipedia" in the health domain. Both web sites are targeted at health consumers, but the language and the content of Simple English Wikipedia articles is supposed to be simpler (hence the name...). There are two parts to this project: (1) We want to build a corpus of pairs of articles specific to diseases. This is a challenge in itself, as we want to pair the articles automatically so we can get a large number of pairs. (2) Once the corpus is built, we want to align segments of texts from the pairs based on how much content they share. This project is useful because it enables us to study how information can be conveyed with different levels of complexity. Keywords: paraphrasing, text similarity metrics, alignment methods, Wikipedia.

If you choose this project, you will interact with Prof. Noemie Elhadad of the Department of Biomedical Informatics.

 

Machine Translation --- This project would be done under the supervision of Dr. Nizar Habash of CCLS (see http://www.ccls.columbia.edu/cadim).

Machine translation (MT) is an area of research focusing on automatic translation of text in one human language (such as Arabic, Spanish or Chinese) into another (e.g. English). Statistical MT is an approach to MT that learns translation models from large parallel text corpora. Examples of parallel corpora include UN documents, European parliament documents, and newswire produced in multiple languages. The translation models are used by MT systems (called "decoders") to convert the source-language text to the target language. Some of the ideas we are interested in include, among others, the following:

  • Morphological preprocessing to translate from languages with complex morphology
  • Syntactic preprocessing to model word reordering
  • Building specialized modules for translation of names of humans and locations between languages using different scripts (such as Arabic and English)

The CADIM group's primary languages of interest are Arabic, Arabic dialects and English. However, we are also interested in languages written in Arabic script (currently specially Urdu) and other Semitic languages (currently specially Hebrew). The MT project.s pairs of languages we are considering are Arabic-English, English-Arabic, Urdu-English, and Hebrew-Arabic. A specific project will be determined after meeting with the student and depending on his/her abilities.

 

CLiMB -- Computational Linguistics for Metadata Building --- This project would be done under the supervision of  Becky Passonneau, becky [at] cs.columbia.edu

Digital image collections are increasing in number and size at an enormous rate, including collections associated with museums,  libraries (New York Public Library; Getty Library), or online collections like ARTstor. CLiMB is a collaborative project (with University of Maryland) to develop automatic methods for extracting metadata from scholarly texts, in order to index digital art collections with subject matter descriptions. The Columbia component involves classifying sentences from art history survey tetxs into semantic categories pertaining to their discourse function.  These include describing the image, providing biographical background about the artist, interpreting the art historical significance of the work, and so on.  We are working with an analog to an ARTstor image collection and two art history survey texts. We are investigating automated methods to assign semantic scores to words from extracted sentences based on their closeness to relevant semantic domains, such as color, anatomy, and so on.  To compute semantic distance in these domains, we will compare electronically available ontologies and lexicons such as WordNet, and the Getty Art and Architecture Thesaurus. The project tasks will include developing subroutines to query these resources, developing evaluation suites to test the resulting scores, and integrating the scores into feature sets for machine learning.

Desirable experience/skills include familiarity with one or more NLP tools or resources for language analysis (taggers, parsers, WordNet); familiarity with the Weka datamining toolset; familiarity with Python.