Projects » Ads for Student Research Projects in NLP
Fall 2008:
Summer 2008:
American National Corpus (Fall 2008)Description:
Help build the American National Corpus (ANC) (http://americannationalcorpus.org),
a research oriented collection of written and spoken language from
1990 or later. The ANC includes many genres and sources, including
email, blogs, fiction, newswire, travel guides, medical texts and so
on. The ANC is used by computational linguists and linguists as a
research resource, and is particularly important for training
general purpose tools such as parsers or word-sense disambiguators
intended to handle many variants of English. The National Science
Foundation (NSF) is funding a project called the Manually Annotated
Sub Corpus of the ANC (MASC) which adds annotations, or tags, to the
documents in the corpus that represent knowledge about language,
such as the difference between nouns and verbs. We have received
supplemental NSF funding under the Research Experience for
Undergraduates (RE) program to mentor two undergraduates for the
2008-2009 academic year. The undergraduates we enlist will work on
adding annotations to our corpus that disambiguate word senses.
Words have multiple meanings, and linguists have constructed lexical
resources that encode these meanings. Accurate results from
automated language processing tools of many sorts depends on the
ability to disambiguate word senses. Students on the project will
have an opportunity to participate in the creation of a new and
important layer of annotation in the ANC. They will learn how
senses are represented in WordNet, FrameNet and other lexical
resources. They will be trained in data collection and verification
methods. A modest stipend is associated with the project.
Requirements:
Keen interest in language, meaning and why the same words mean different things in
different contexts. Detail oriented. Excellent organization and time management skills.
Useful skills: Background in linguistics, foreign languages, or related areas.
Suitable for: Undergraduates only, any year. Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
The Loqui Project (Fall 2008)Description:
The Loqui project (http://www1.ccls.columbia.edu/~Loqui/)
involves building an automated dialog system to be used over the phone,
meaning that human callers will speak with a computer that can handle
limited types of human dialog. Loqui, a collaborative project with the
City University of New York, is funded by the National Science
Foundation (NSF). We have additional funding to mentor two
undergraduates as part of the NSF Research Experience for Undergraduates
(REU) program. We are looking for two undergraduate computer science
majors. They will be introduced to state-of-the-art software and
techniques in computational linguistics and computer science, and
acquire ethical training in the use of human subjects via Institutional
Research Board certification. They will learn how data about human
dialog is collected in order to inform the design of a dialog system.
They will learn how to enhance the language resources used by dialog
systems, how to use human-human corpora and human-system corpora to
implement and evaluate dialog systems, and how research demands care and
imagination. A modest stipend is associated with the project.
Requirements:
Keen interest in language and how it is used. Good programming skills in any of C, Java, Perl.
Useful skills:
Background in linguistics, foreign languages, or mathematics. Experience with both linux and Windows platforms. Interest in telephony, VOIP.
Suitable for: Undergraduates only, any year. Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
Grammar extraction from treebanks (Fall 2008)Description:
The job is to write code that extracts various types of grammars from treebanks.
Requirements: Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax). Suitable for: Junior, Senior, Graduate.
This could be for an advanced undergrad, or a beginning graduate student,
but other types of candidates are also thinkable. The job will not be a
GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time
basis. Rate commensurate with relevant criteria.
Contact: Owen Rambow (rambow [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
A Machine Learning Approach for Automatic Labeling of ECS Tickets (Fall 2008)Description:
Our goal is to develop an automatic labeling method for events that
we have been labeling using a rule-based procedure. The events
correspond to a trouble tickets database of the secondary electrical
distribution system of the Consolidated Edison Company of New York.
The Emergency Control Systems (ECS) "tickets" database is a rich
resource for data mining containing approximately 1 million tickets
from all boroughs. Each ECS "trouble ticket" is a report of an event
affecting the New York City electrical distribution system as
recorded by a Con Edison dispatcher. The "front" of each ticket
contains a timestamp, type of event (such as manhole fire or smoking
manhole), address and cross street information where the event
occurred along with other pertinent information. The "back" of the
ticket (called the ECS-Remarks) contains free-text description of
Con Edison's response and repairs made.
The larger learning task we face is to predict serious events based
on data from several data bases, including the ECS tickets. In prior
work done, we extracted features from the ECS Remarks for use in two
aspects of learning: labeling the data, and extracting features for
the learning model. Remarks features include external features,
such as the length of the ticket, the trouble type of the ticket
(assigned by Con Edison), and its date, and internal features based
on the content of the ticket, such as what structures (manholes,
service boxes) are mentioned, and how frequently.
Based on the knowledge from subject matter experts (SMEs) we labeled
tickets into two categories - Serious or Non-serious. In this
project, we are interested in developing an automatic ticket
labeler using machine learning techniques. In particular, we are
interested in the following tasks:
- Refinement of features extracted from ECS Remarks, and possible addition of new features depending on status of concurrent work on spelling normalization.
- Development of classifiers (such as decision trees) for performing the labeling task.
- Extraction of classification rules from trees that may give a better insight on the criteria required for labeling a ticket as serious or non-serious.
- Rules extracted from step 3 above can be tested by incorporation into models used for ranking structures (manholes and service boxes).
Requirements: Knowledge of PostgreSQL, and proficiency in Java or Matlab
recommended. A strong background in algorithms and machine learning is a
plus. Both under-graduate and graduate students with relevant expertise
are encouraged to apply. Contact: Rebecca Passonneau (becky [at_cs])
and Haimonti Dutta (haimonti [at_ccls])
Replace [at_cs] with "@cs.columbia.edu", and [at_ccls] with "@ccls.columbia.edu".
ARA -- Automated Readers Advisor (Summer 2008)Description:
We are collaborating with the Andrew Heiskell Talking Book and Braille
Library (part of the New York Public Library) and City University of New
York on a project to design an automated dialogue system for the
library's 15,000 patrons to handle simple library transactions. For the
same reasons that qualify patrons to become library users, most cannot
conveniently travel to the library and visually browse the collection,
thus most transactions are currently handled by phone. In the summer of
2006, we collected just under 200 recorded calls made to a subset of
human Readers Advisors (librarians and other staff), and have
transcribed a large portion of them. We are currently implementing an
initial baseline dialog system using the Olympus/Ravenclaw tools from
Carnegie Mellon University. One or more student projects are available
within this project that would involve analysis of the transcribed
human-human dialogs, testing and enhancing our initial dialog system, or
a combination of the two.
Requirements: Some exposure to NLP, reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively. Useful skills:Experience with annotating corpora, or using annotated corpora; familiarity with C++, perl, psql/mysql or other databases. Suitable for: junior, senior, graduate Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
CLiMB -- Computational Linguistics for Metadata Building. (Summer 2008)Description:
Digital image collections are increasing in number and size at an
enormous rate, including collections associated with museums,
libraries (New York Public Library; Getty Library), or online
collections like ARTstor. CLiMB is a collaborative project (with
University of Maryland) to develop automatic methods for extracting
metadata from scholarly texts, in order to index digital art
collections with subject matter descriptions. The Columbia component
involves classifying sentences from art history survey texts into
semantic categories pertaining to their discourse function.
Functional classes include describing the image, providing
biographical background about the artist, interpreting the art
historical significance of the work, and so on. We are working with
an analog to an ARTstor image collection and two art history survey
texts. We are investigating automated methods to assign semantic
scores to words from extracted sentences based on their closeness to
relevant semantic domains, such as color, anatomy, and so on. To
compute semantic distance in these domains, we will compare
electronically available ontologies and lexicons such as WordNet, and
the Getty Art and Architecture Thesaurus. The project tasks will
include developing subroutines to query these resources, developing
evaluation suites to test the resulting scores, and integrating the
scores into feature sets for machine learning.
Requirements:
Desirable experience/skills include familiarity with one or more NLP
tools or resources for language analysis (taggers, parsers, WordNet);
familiarity with the Weka datamining toolset; familiarity with Python.
Suitable for: junior, senior, graduate Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
Processing Trouble Tickets; Con Edison Secondary Events (two projects) (Summer 2008)Description:
The Secondary Events project at the Center for Computational Learning
Systems (CCLS) works with data from Con Edison's secondary
distribution network. The two learning tasks we address are to predict
problematic events in the network before they happen, and to rank the
vulnerability of structures in this network to such events. We have
devoted considerable effort to assembling a consolidated database from
disparate sources after cleaning, extending and joining data collected
over the past ten years. The effort has paid off in initial success
in our two learning problems for data from Manhattan, using small
models.
We now turn to investigating whether we can derive a larger set of
features from a free-text field of trouble ticket data. Two summer
positions are available on this phase.
Project 1: Text Engineering Applied to Remarks Fields of Trouble Tickets
Relational databases often have free-text fields, but extracting
meaningful semantic content from free-text presents serious
challenges. The trouble ticket remarks fields are especially
challenging because the text is highly domain specific, with many
types of domain specific expressions that are essentially Named
Entities (proper nouns). As a result, existing Named Entity (NE)
recognizers cannot be applied to this text. We are importing the
remarks into GATE (General Architecture for Text Engineering) in order
to develop a standoff annotation where we encode the domain specific
classes of NEs. GATE stores the annotations in a relational database
to facilitate complex queries over the text. The student on this
project will assist in porting our existing patterns for Information
Extraction of structure types and numbers (ids for manholes, service
boxes, vaults) and other domain specific NEs. More importantly, the
student will help develop our meta-language for representing the
content of remarks tickets.
Project 2: Text Normalization and Spelling Correction
The free-text remarks field presents serious challenges for feature
derivation due to the high noise content, which is compounded by the
highly domain-specific vocabulary. For example, the size of the
unigram vocabulary (individual strings composed of alphabetic,
numeric, punctuation or mixed characters) is approximately 75K, of
which only 8K (~ 11%) are alphabetic sequences that match American
English dictionary entries. The remaining "word" types consist of
numeric or mixed-character type strings, domain-specific words or
abbreviations, or misspellings. The longer the word, the more
misspellings, thus the word "barricade" and its other forms
("barricades", "barricaded") have approximately 50 variant spellings.
The student on this project will assist in normalizing the
vocabulary. This will involve a range of methods including: pattern
matching within the GATE framework (see project #1), using the GATE
pattern language; development of special-purpose edit-distance
routines; testing and/or adaptation of exsisting alogorithms such as
Double Metaphone.
Requirements: Reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively. Desirable skills (any mix):Experience with regular expressions and unix/linux scripting, java, relational db especially postgres, python, some NLP. Suitable for: junior, senior, graduate Contact: Becky Passonneau (becky [at_cs])
Replace [at_cs] with "@cs.columbia.edu"
Grammar extraction from treebanks (Summer 2008)Description:
The job is to write code that extracts various types of grammars from treebanks.
Requirements: Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax). Suitable for: Junior, Senior, Graduate.
This could be for an advanced undergrad, or a beginning graduate student,
but other types of candidates are also thinkable. The job will not be a
GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time
basis. Rate commensurate with relevant criteria.
Contact: Owen Rambow (rambow [at_cs])
Replace [at_cs] with "@cs.columbia.edu" |