trans Natural Language Processing Group
Department of Computer Science - Columbia University


• Home

• People

• Projects

   Student Projects

• Publications

   Ph.D. Theses

• Events

   NLP Meetings

   OTSLAC Meetings

   NLP Calendar

• Tools

• NLP Lab

• Internal


• Speech Lab

• CCLS














Projects » Ads for Student Research Projects in NLP

Fall 2008:

Summer 2008:

Spring 2008:


top

ARA -- Automated Readers Advisor (Summer 2008)

Description:

We are collaborating with the Andrew Heiskell Talking Book and Braille Library (part of the New York Public Library) and City University of New York on a project to design an automated dialogue system for the library's 15,000 patrons to handle simple library transactions. For the same reasons that qualify patrons to become library users, most cannot conveniently travel to the library and visually browse the collection, thus most transactions are currently handled by phone. In the summer of 2006, we collected just under 200 recorded calls made to a subset of human Readers Advisors (librarians and other staff), and have transcribed a large portion of them. We are currently implementing an initial baseline dialog system using the Olympus/Ravenclaw tools from Carnegie Mellon University. One or more student projects are available within this project that would involve analysis of the transcribed human-human dialogs, testing and enhancing our initial dialog system, or a combination of the two.

Requirements:

Some exposure to NLP, reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively.

Useful skills:

Experience with annotating corpora, or using annotated corpora; familiarity with C++, perl, psql/mysql or other databases.

Suitable for:

junior, senior, graduate

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

CLiMB -- Computational Linguistics for Metadata Building. (Summer 2008)

Description:

Digital image collections are increasing in number and size at an enormous rate, including collections associated with museums, libraries (New York Public Library; Getty Library), or online collections like ARTstor. CLiMB is a collaborative project (with University of Maryland) to develop automatic methods for extracting metadata from scholarly texts, in order to index digital art collections with subject matter descriptions. The Columbia component involves classifying sentences from art history survey texts into semantic categories pertaining to their discourse function. Functional classes include describing the image, providing biographical background about the artist, interpreting the art historical significance of the work, and so on. We are working with an analog to an ARTstor image collection and two art history survey texts. We are investigating automated methods to assign semantic scores to words from extracted sentences based on their closeness to relevant semantic domains, such as color, anatomy, and so on. To compute semantic distance in these domains, we will compare electronically available ontologies and lexicons such as WordNet, and the Getty Art and Architecture Thesaurus. The project tasks will include developing subroutines to query these resources, developing evaluation suites to test the resulting scores, and integrating the scores into feature sets for machine learning.

Requirements:

Desirable experience/skills include familiarity with one or more NLP tools or resources for language analysis (taggers, parsers, WordNet); familiarity with the Weka datamining toolset; familiarity with Python.

Suitable for:

junior, senior, graduate

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

Processing Trouble Tickets; Con Edison Secondary Events (two projects) (Summer 2008)

Description:

The Secondary Events project at the Center for Computational Learning Systems (CCLS) works with data from Con Edison's secondary distribution network. The two learning tasks we address are to predict problematic events in the network before they happen, and to rank the vulnerability of structures in this network to such events. We have devoted considerable effort to assembling a consolidated database from disparate sources after cleaning, extending and joining data collected over the past ten years. The effort has paid off in initial success in our two learning problems for data from Manhattan, using small models.

We now turn to investigating whether we can derive a larger set of features from a free-text field of trouble ticket data. Two summer positions are available on this phase.

Project 1: Text Engineering Applied to Remarks Fields of Trouble Tickets

Relational databases often have free-text fields, but extracting meaningful semantic content from free-text presents serious challenges. The trouble ticket remarks fields are especially challenging because the text is highly domain specific, with many types of domain specific expressions that are essentially Named Entities (proper nouns). As a result, existing Named Entity (NE) recognizers cannot be applied to this text. We are importing the remarks into GATE (General Architecture for Text Engineering) in order to develop a standoff annotation where we encode the domain specific classes of NEs. GATE stores the annotations in a relational database to facilitate complex queries over the text. The student on this project will assist in porting our existing patterns for Information Extraction of structure types and numbers (ids for manholes, service boxes, vaults) and other domain specific NEs. More importantly, the student will help develop our meta-language for representing the content of remarks tickets.

Project 2: Text Normalization and Spelling Correction

The free-text remarks field presents serious challenges for feature derivation due to the high noise content, which is compounded by the highly domain-specific vocabulary. For example, the size of the unigram vocabulary (individual strings composed of alphabetic, numeric, punctuation or mixed characters) is approximately 75K, of which only 8K (~ 11%) are alphabetic sequences that match American English dictionary entries. The remaining "word" types consist of numeric or mixed-character type strings, domain-specific words or abbreviations, or misspellings. The longer the word, the more misspellings, thus the word "barricade" and its other forms ("barricades", "barricaded") have approximately 50 variant spellings. The student on this project will assist in normalizing the vocabulary. This will involve a range of methods including: pattern matching within the GATE framework (see project #1), using the GATE pattern language; development of special-purpose edit-distance routines; testing and/or adaptation of exsisting alogorithms such as Double Metaphone.

Requirements:

Reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively.

Desirable skills (any mix):

Experience with regular expressions and unix/linux scripting, java, relational db especially postgres, python, some NLP.

Suitable for:

junior, senior, graduate

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

Topics in Machine Translation (Spring 2008)

Description:

Machine translation (MT) is an area of research focusing on automatic translation of text in one human language (such as Arabic, Spanish or Chinese) into another (e.g. English).  Statistical MT is an approach to MT that learns translation models from large parallel text corpora. Examples of parallel corpora include UN documents, European parliament documents, and newswire produced in multiple languages.  The translation models are used by MT systems (called "decoders") to convert the source-language text to the target language.   Some of the ideas we are interested in include, among others, the following:

  • Morphological preprocessing to translate from languages with complex morphology
  • Syntactic preprocessing to model word reordering
  • Building specialized modules for translation of names of humans and locations between languages using different scripts (such as Arabic and English)

The CADIM group's primary languages of interest are Arabic, Arabic dialects and English. However, we are also interested in languages written in Arabic script (currently specially Urdu) and other Semitic languages (currently specially Hebrew). The MT project's pairs of languages we are considering are Arabic-English, English-Arabic, Urdu-English, and Hebrew-Arabic. A specific project will be determined after meeting with the student and depending on his/her abilities.

Research Group:

Columbia Arabic Dialect Modeling Group (CADIM)
Center for Computational Learning Systems (CCLS)
http://www.ccls.columbia.edu/cadim.html

Requirements:

No knowledge of Arabic, Hebrew or Urdu is required but may be very useful, not to mention that it will make the experience more exciting. Exposure to machine learning and/or linguistics is preferred but not necessary.

Suitable for:

junior, senior, graduate

Contact:

Dr. Nizar Habash (habash [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

Grammar extraction from treebanks (Summer 2008)

Description:

The job is to write code that extracts various types of grammars from treebanks.

Requirements:

Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax).

Suitable for:

Junior, Senior, Graduate.

This could be for an advanced undergrad, or a beginning graduate student, but other types of candidates are also thinkable. The job will not be a GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time basis. Rate commensurate with relevant criteria.

Contact:

Owen Rambow (rambow [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

Grammar extraction from treebanks (Fall 2008)

Description:

The job is to write code that extracts various types of grammars from treebanks.

Requirements:

Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax).

Suitable for:

Junior, Senior, Graduate.

This could be for an advanced undergrad, or a beginning graduate student, but other types of candidates are also thinkable. The job will not be a GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time basis. Rate commensurate with relevant criteria.

Contact:

Owen Rambow (rambow [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


webmaster - agusx[at]xcs.columbia.edu last updated - 05.15.2008