Home
| |
Student Projects
Contact me at rambow@ccls.columbia.edu if interested.
Hindi/Urdu Web Search Tool for Linguists (Fall 2008)
The goal is to create a tool that can expand a query into a complex query
for Google or another seacrh engine, and that helps linguists find
sentences with particular syntactic properties. For examples, we could
ask to find sentences that have a pattern of "-nii" followed by a form of
"hai", as in "Atif-ko kitaab parh-nii hai". The input would be in
Devanagari or (preferred) in Latin. The project would also involve
mapping the input to Urdu orthpgraphy and searching on Urdu text. While
the immediate application is a tool for linguists to find example
sentences, there are also clear commerical applications.
The project requires knowledge of Hindi and programming skills. This is a
project that does not involve machine learning.
Email and Social Networks (Fall 2008)
We have a database of emails and a database that shows the position of
the writers in the company for which they worked. The goal is to use
machine learning to predict the origanizational position from the emails,
and from the social network the email communication induces.
The project requires familiarity with using machine learning, programming
skills, and interest in language use and in social network theory.
Training an English Parser on Unannotated Text (Fall
2008)
We have a parser that consists of two parts: a "supertagger" that
associates words in a sentence with tags which contain rich lexical and
syntactic information, and the actual parser which uses this information
to create a parse tree. The supertagger suggests 10-best tags, and the
parsers selects from among them. The idea is to use data which has not
been annotated by humans. We do this by supertagging it, and then having
the parser "choose" from the 10-best tags suggested by the supertagger.
The resulting supertagged corpus is used to retrain the supertagger.
The project requires an understanding of machine learning, interest in
syntax, and programming skills.
Reference: Nonlexical Chart Parsing for TAG (Alexis Nasr and Owen Rambow)
Arabic Base Phrase Chunking (Fall
2008)
Base phrase chunking (BPC) refers to idnetifying syntactically meaningful
chunks by using tagging (as opposed to parsing, which is computationally
more costly). For Arabic, the problem is to define the correct chunk size:
if the chunk is too small, the task is not useful; if it is too large, the
accuracy is not good enough.
The project requires an understanding of machine learning, and some minimal
interest in syntax.
s
Arabic Morphology (Fall 2004)
This project involves working on Arabic morphology. Arabic morphology is usually considered more complex than English morphology because it involves not only prefixes and suffixes (such as -ed or -ing in English), but also changes to the lexical root itself (perhaps somewhat like English sing, sang, sung, song, but more common). This project will result in an elegant implementation of an existing computational theory of Arabic morphology using a finite-state machine (FSM) toolkit.
The project will expose the student to a cutting-edge theory of morphology, as well as to using FSMs for computational linguistic work. FSMs are extremely useful tools for many areas of speech and natural language processing. This semester-long project could lead to follow-on work in which the system is expanded to handle multiple Arabic dialects. The follow-on work could potentially be funded.
No knowledge of Arabic or any particular FSM toolkit is required, but some facility with programming is expected.
For information, please contact both Owen Rambow (rambow@cs.columbia.edu) and Nizar Habash
(habash@cs.columbia.edu)
Project: Dialect Speech Recognition (Fall 2004)
This project is aimed at finding ways of exploiting linguistic knowledge in automatic speech recognition (ASR) for Arabic. Speech recognizers for English typically employ: an acoustic
model, which provides weighted hypotheses of what sounds were uttered; a
pronunciation dictionary, which mediates between sounds and words; and a language model (LM), which can rank hypothesized utterances by
their plausibility. This approach does not carry over easily to Arabic, due to the fact that the written language is generally not spoken and
the spoken language not written (diglossia). The issue is how best to employ language models for written Arabic to inform a
speech recognizer for Egyptian Colloquial Arabic. In this project, we use a
machine translation module that translates the Egyptian hypotheses into written Arabic strings.
The project consists in finding out how best to combine the "goodness"
scores provided by the acoustic component, the Egyptian LM, and the written Arabic LM. The task is made interesting by the fact that
the translation software provides not a single written Arabic sentence, but a graph of many possible strings. Furthermore, we
hypothesize that different words should trigger different weights for the
scores. This project thus aims at a solution that is much more innovative than
the standard practice in ASR, which is to find a fixed linear function of the three scores.
This project involves machine learning and corpora. No knowledge of Arabic is required. An interest in the linguistic phenomena of
Arabic would be helpful, but is also not required. What is required is facility with programming and an interest in machine learning.
The student will be given a good introduction to the use of cutting-edge machine learning tools for speech data.
If interested, please contact both Martin Jansche (jansche@cs.columbia.edu) and Owen Rambow (rambow@cs.columbia.edu).
|