Projects » Ads for Student Research Projects in NLP
Fall 2008:
Summer 2008:
Spring 2008:
ARA -- Automated Readers Advisor (Summer 2008)Description:
We are collaborating with the Andrew Heiskell Talking Book and Braille
Library (part of the New York Public Library) and City University of New
York on a project to design an automated dialogue system for the
library's 15,000 patrons to handle simple library transactions. For the
same reasons that qualify patrons to become library users, most cannot
conveniently travel to the library and visually browse the collection,
thus most transactions are currently handled by phone. In the summer of
2006, we collected just under 200 recorded calls made to a subset of
human Readers Advisors (librarians and other staff), and have
transcribed a large portion of them. We are currently implementing an
initial baseline dialog system using the Olympus/Ravenclaw tools from
Carnegie Mellon University. One or more student projects are available
within this project that would involve analysis of the transcribed
human-human dialogs, testing and enhancing our initial dialog system, or
a combination of the two.
Requirements: Some exposure to NLP, reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively. Useful skills:Experience with annotating corpora, or using annotated corpora; familiarity with C++, perl, psql/mysql or other databases. Suitable for: junior, senior, graduate Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
CLiMB -- Computational Linguistics for Metadata Building. (Summer 2008)Description:
Digital image collections are increasing in number and size at an
enormous rate, including collections associated with museums,
libraries (New York Public Library; Getty Library), or online
collections like ARTstor. CLiMB is a collaborative project (with
University of Maryland) to develop automatic methods for extracting
metadata from scholarly texts, in order to index digital art
collections with subject matter descriptions. The Columbia component
involves classifying sentences from art history survey texts into
semantic categories pertaining to their discourse function.
Functional classes include describing the image, providing
biographical background about the artist, interpreting the art
historical significance of the work, and so on. We are working with
an analog to an ARTstor image collection and two art history survey
texts. We are investigating automated methods to assign semantic
scores to words from extracted sentences based on their closeness to
relevant semantic domains, such as color, anatomy, and so on. To
compute semantic distance in these domains, we will compare
electronically available ontologies and lexicons such as WordNet, and
the Getty Art and Architecture Thesaurus. The project tasks will
include developing subroutines to query these resources, developing
evaluation suites to test the resulting scores, and integrating the
scores into feature sets for machine learning.
Requirements:
Desirable experience/skills include familiarity with one or more NLP
tools or resources for language analysis (taggers, parsers, WordNet);
familiarity with the Weka datamining toolset; familiarity with Python.
Suitable for: junior, senior, graduate Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
Processing Trouble Tickets; Con Edison Secondary Events (two projects) (Summer 2008)Description:
The Secondary Events project at the Center for Computational Learning
Systems (CCLS) works with data from Con Edison's secondary
distribution network. The two learning tasks we address are to predict
problematic events in the network before they happen, and to rank the
vulnerability of structures in this network to such events. We have
devoted considerable effort to assembling a consolidated database from
disparate sources after cleaning, extending and joining data collected
over the past ten years. The effort has paid off in initial success
in our two learning problems for data from Manhattan, using small
models.
We now turn to investigating whether we can derive a larger set of
features from a free-text field of trouble ticket data. Two summer
positions are available on this phase.
Project 1: Text Engineering Applied to Remarks Fields of Trouble Tickets
Relational databases often have free-text fields, but extracting
meaningful semantic content from free-text presents serious
challenges. The trouble ticket remarks fields are especially
challenging because the text is highly domain specific, with many
types of domain specific expressions that are essentially Named
Entities (proper nouns). As a result, existing Named Entity (NE)
recognizers cannot be applied to this text. We are importing the
remarks into GATE (General Architecture for Text Engineering) in order
to develop a standoff annotation where we encode the domain specific
classes of NEs. GATE stores the annotations in a relational database
to facilitate complex queries over the text. The student on this
project will assist in porting our existing patterns for Information
Extraction of structure types and numbers (ids for manholes, service
boxes, vaults) and other domain specific NEs. More importantly, the
student will help develop our meta-language for representing the
content of remarks tickets.
Project 2: Text Normalization and Spelling Correction
The free-text remarks field presents serious challenges for feature
derivation due to the high noise content, which is compounded by the
highly domain-specific vocabulary. For example, the size of the
unigram vocabulary (individual strings composed of alphabetic,
numeric, punctuation or mixed characters) is approximately 75K, of
which only 8K (~ 11%) are alphabetic sequences that match American
English dictionary entries. The remaining "word" types consist of
numeric or mixed-character type strings, domain-specific words or
abbreviations, or misspellings. The longer the word, the more
misspellings, thus the word "barricade" and its other forms
("barricades", "barricaded") have approximately 50 variant spellings.
The student on this project will assist in normalizing the
vocabulary. This will involve a range of methods including: pattern
matching within the GATE framework (see project #1), using the GATE
pattern language; development of special-purpose edit-distance
routines; testing and/or adaptation of exsisting alogorithms such as
Double Metaphone.
Requirements: Reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively. Desirable skills (any mix):Experience with regular expressions and unix/linux scripting, java, relational db especially postgres, python, some NLP. Suitable for: junior, senior, graduate Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
Topics in Machine Translation (Spring 2008)Description:
Machine translation (MT) is an area of research focusing on automatic translation of text in one human language (such as Arabic, Spanish or Chinese) into another (e.g. English). Statistical MT is an approach to MT that learns translation models from large parallel text corpora. Examples of parallel corpora include UN documents, European parliament documents, and newswire produced in multiple languages. The translation models are used by MT systems (called "decoders") to convert the source-language text to the target language. Some of the ideas we are interested in include, among others, the following:
- Morphological preprocessing to translate from languages with complex morphology
- Syntactic preprocessing to model word reordering
- Building specialized modules for translation of names of humans and locations between languages using different scripts (such as Arabic and English)
The CADIM group's primary languages of interest are Arabic, Arabic dialects and English. However, we are also interested in languages written in Arabic script (currently specially Urdu) and other Semitic languages (currently specially Hebrew). The MT project's pairs of languages we are considering are Arabic-English, English-Arabic, Urdu-English, and Hebrew-Arabic. A specific project will be determined after meeting with the student and depending on his/her abilities.
Research Group: Columbia Arabic Dialect Modeling Group (CADIM) Center for Computational Learning Systems (CCLS) http://www.ccls.columbia.edu/cadim.html Requirements: No knowledge of Arabic, Hebrew or Urdu is required but may be very useful, not to mention that it will make the experience more exciting. Exposure to machine learning and/or linguistics is preferred but not necessary. Suitable for: junior, senior, graduate Contact: Dr. Nizar Habash (habash [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
Grammar extraction from treebanks (Summer 2008)Description:
The job is to write code that extracts various types of grammars from treebanks.
Requirements: Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax). Suitable for: Junior, Senior, Graduate.
This could be for an advanced undergrad, or a beginning graduate student,
but other types of candidates are also thinkable. The job will not be a
GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time
basis. Rate commensurate with relevant criteria.
Contact: Owen Rambow (rambow [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
Grammar extraction from treebanks (Fall 2008)Description:
The job is to write code that extracts various types of grammars from treebanks.
Requirements: Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax). Suitable for: Junior, Senior, Graduate.
This could be for an advanced undergrad, or a beginning graduate student,
but other types of candidates are also thinkable. The job will not be a
GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time
basis. Rate commensurate with relevant criteria.
Contact: Owen Rambow (rambow [at_cs])
Replace [at_cs] with "@cs.columbia.edu" |