Student Projects

Owen Rambow's Homepage

Under Development

Student Projects

Contact me at rambow@ccls.columbia.edu if interested.

Hindi/Urdu Web Search Tool for Linguists (Fall 2008)

The goal is to create a tool that can expand a query into a complex query for Google or another seacrh engine, and that helps linguists find sentences with particular syntactic properties. For examples, we could ask to find sentences that have a pattern of "-nii" followed by a form of "hai", as in "Atif-ko kitaab parh-nii hai". The input would be in Devanagari or (preferred) in Latin. The project would also involve mapping the input to Urdu orthpgraphy and searching on Urdu text. While the immediate application is a tool for linguists to find example sentences, there are also clear commerical applications.

The project requires knowledge of Hindi and programming skills. This is a project that does not involve machine learning.

Email and Social Networks (Fall 2008)

We have a database of emails and a database that shows the position of the writers in the company for which they worked. The goal is to use machine learning to predict the origanizational position from the emails, and from the social network the email communication induces.

The project requires familiarity with using machine learning, programming skills, and interest in language use and in social network theory.

Training an English Parser on Unannotated Text (Fall 2008)

We have a parser that consists of two parts: a "supertagger" that associates words in a sentence with tags which contain rich lexical and syntactic information, and the actual parser which uses this information to create a parse tree. The supertagger suggests 10-best tags, and the parsers selects from among them. The idea is to use data which has not been annotated by humans. We do this by supertagging it, and then having the parser "choose" from the 10-best tags suggested by the supertagger. The resulting supertagged corpus is used to retrain the supertagger.

The project requires an understanding of machine learning, interest in syntax, and programming skills.

Reference: Nonlexical Chart Parsing for TAG (Alexis Nasr and Owen Rambow)

Arabic Base Phrase Chunking (Fall 2008)

Base phrase chunking (BPC) refers to idnetifying syntactically meaningful chunks by using tagging (as opposed to parsing, which is computationally more costly). For Arabic, the problem is to define the correct chunk size: if the chunk is too small, the task is not useful; if it is too large, the accuracy is not good enough.

The project requires an understanding of machine learning, and some minimal interest in syntax. s

Arabic Morphology (Fall 2004)

This project involves working on Arabic morphology. Arabic morphology is usually considered more complex than English morphology because it involves not only prefixes and suffixes (such as -ed or -ing in English), but also changes to the lexical root itself (perhaps somewhat like English sing, sang, sung, song, but more common). This project will result in an elegant implementation of an existing computational theory of Arabic morphology using a finite-state machine (FSM) toolkit.

The project will expose the student to a cutting-edge theory of morphology, as well as to using FSMs for computational linguistic work. FSMs are extremely useful tools for many areas of speech and natural language processing. This semester-long project could lead to follow-on work in which the system is expanded to handle multiple Arabic dialects. The follow-on work could potentially be funded.

No knowledge of Arabic or any particular FSM toolkit is required, but some facility with programming is expected.

For information, please contact both Owen Rambow (rambow@cs.columbia.edu) and Nizar Habash (habash@cs.columbia.edu)

Project: Dialect Speech Recognition (Fall 2004)

This project is aimed at finding ways of exploiting linguistic knowledge in automatic speech recognition (ASR) for Arabic. Speech recognizers for English typically employ: an acoustic model, which provides weighted hypotheses of what sounds were uttered; a pronunciation dictionary, which mediates between sounds and words; and a language model (LM), which can rank hypothesized utterances by their plausibility. This approach does not carry over easily to Arabic, due to the fact that the written language is generally not spoken and the spoken language not written (diglossia). The issue is how best to employ language models for written Arabic to inform a speech recognizer for Egyptian Colloquial Arabic. In this project, we use a machine translation module that translates the Egyptian hypotheses into written Arabic strings.

The project consists in finding out how best to combine the "goodness" scores provided by the acoustic component, the Egyptian LM, and the written Arabic LM. The task is made interesting by the fact that the translation software provides not a single written Arabic sentence, but a graph of many possible strings. Furthermore, we hypothesize that different words should trigger different weights for the scores. This project thus aims at a solution that is much more innovative than the standard practice in ASR, which is to find a fixed linear function of the three scores.

This project involves machine learning and corpora. No knowledge of Arabic is required. An interest in the linguistic phenomena of Arabic would be helpful, but is also not required. What is required is facility with programming and an interest in machine learning. The student will be given a good introduction to the use of cutting-edge machine learning tools for speech data.

If interested, please contact both Martin Jansche (jansche@cs.columbia.edu) and Owen Rambow (rambow@cs.columbia.edu).