Projects

» Newsblaster
» GALE
» SCIL
» Understanding Written Narrative
» STAGES
» Text to Text Generation
» AQUAINT (Completed)
» PERSIVAL (Completed)
» Additional NLP projects could be found under labs or people pages.
» Undergraduate and master students interested in research projects, please visit our Ad page.

Newsblaster

Newsblaster is a system that helps users find the news that is of the most interest to them. The system automatically collects, clusters, categorizes, and summarizes news from several sites on the web (CNN, Reuters, Fox News, etc.) on a daily basis, and it provides users a user-friendly interface to browse the results. Articles on the same story from various sources are presented together and summarized using state-of-the-art techniques. The Newsblaster system has already caught the attention of the press and public. A recent analysis indicates that Newsblaster receives tens of thousands of hits a day, and news agencies that have written articles about Newsblaster include the New York Times, USA Today, and Slashdot.

» http://newsblaster.cs.columbia.edu

GALE

We are participating in the DARPA Global Autonomous Language Exploitation (GALE) program, a five-year federal initiative that is seeking to go far beyond search engine technology to answer complex questions from multilingual, multimodal sources of varying types, including blogs, talk show transcriptions as well as published news. Overall, GALE will draw on automatic speech recognition, machine translation and summarization technologies to produce answers in real time. The NLP group at Columbia is involved in the translation of Arabic source material and in aspects of multilingual QA. We are working on detecting and correcting translations included in the answer based on the extra information that is available in the QA context.

SCIL

We are participating in the IARPA Socio-Cultural Content in Language (SCIL) program. Over the course of the project we are attempting to predict power, influence and rifts in social groups through linguistic analysis. In this research we examine unstructured text such as blogs, forums, chats, and emails in English, Arabic, and Urdu by examining social networks and linguistic features, such as opinion, syntax, and dialog, across conversations. We attempt to identify how language usage correlates to social phenomena such as regional origin, age, gender, and religious affiliation.

Understanding Written Narrative

The Scheherazade project's goal is to advance the state of the art in automatic understanding of written narrative, especially literature. Goals of the project include: (1) describe a new formalism for modeling stories, including actions, time, and causality; (2) implement the formalism in a toolkit in which a user can encode and reason over stories; (3) automatically find connections between stories, in both structure and content; and (4) make progress toward automatically encoding literature, such as automatically attributing quoted speech, extracting social networks and identifying scenes.

STAGES

STAGES (Statistical Translation And GEneration using Semantics) is a joint machine translation project. Its new approach to MT, combining semantic analysis, new forms of statistical MT and language generation, is aimed at handling fundamental differences in how Chinese and English encode information. These include differences in event descriptions (predicate-argument structures), realization of tense, grammatical function words, constituent ordering (particularly when long distance dependencies are involved), and discourse relations. The NLP group at Columbia is focusing on fusing the input from the statistical MT systems and the semantic analysis of the source and target language sentences to produce a coherent, faithful and grammatical translation.

Text to Text Generation

The goal of this project is to generate short, effective, and readable summaries from informal spoken and written texts. Documents such as automatically transcribed speech and email are not guaranteed to be either grammatical or complete, and it is thus necessary to develop robust techniques for text-to-text generation that can transform ill-formed input into concise sentences that are maximally readable. We are in the process of developing a trainable syntax-directed sentence revision system that integrates many knowledge sources, such as syntactic transformation rules, language models, and information retrieval metrics. We expect this generation system to be applicable to a broad set of natural language applications, such as summarization and open-ended question-answering.

AQUAINT (Completed)

The goal of the AQUAINT system was to address a scenario in which multiple, inter-related questions are asked in a particular topic area by a skilled, professional information analyst who is attempting to respond to larger, more complex information needs or requirements. Within the AQUAINT project we focused on the issues involved in selecting information on different levels targeting questions of several types. These question types include definitions, opinions or biographies as answers. We built a prototype system, QUAC, which integrated the work of interpreting the question, searching for relevant documents, selecting text snippets corresponding to the answer, and formulating the answer.

This project was a collaborative effort with a team from the Colorado University Center for Spoken Language Research.

PERSIVAL (Completed)

In healthcare settings, healthcare consumers and providers both need quick and easy access to a wide range of online resources. PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video And Language resources) aims to provide personalized access to a distributed patient care digital library. PERSIVAL is a joint research initiative between the fields of NLP, human-computer interaction, medical informatics, video processing, library and cognitive science. Key features of PERSIVAL include personalized access to distributed, multimedia resources available both locally and over the Internet, fusion of repetitive information and identification of conflicting information from multiple relevant sources, and presentation of information in concise multimedia summaries that cross-link images, video, and text. When the latest medical information is provided at the point of patient care, it can help practicing clinicians to avoid missed diagnoses and minimize impending complications. When expressed in understandable terms, it can empower patients to take charge of their healthcare.

webmaster - fl2301x[at]xcolumbia.edu

last updated - 12.31.1969