|
• Home
• People
• Projects
Student Projects
• Publications
Ph.D. Theses
• Events
NLP Meetings
OTSLAC Meetings
NLP Calendar
• Tools
• NLP Lab
• Internal
• Speech Lab
• CCLS
|
Projects
Newsblaster
Newsblaster is a system that helps users find the news that is of the most interest to them. The system automatically collects, clusters, categorizes, and summarizes news from several sites on the web (CNN, Reuters, Fox News, etc.) on a daily basis, and it provides users a user-friendly interface to browse the results. Articles on the same story from various sources are presented together and summarized using state-of-the-art techniques. The Newsblaster system has already caught the attention of the press and public. A recent analysis indicates that Newsblaster receives tens of thousands of hits a day, and news agencies that have written articles about Newsblaster include the New York Times, USA Today, and Slashdot.
Note: Newsblaster is currently not open to general public use. However, it may be used for educational or non-commercial research purposes. To request a password, send e-mail to ss1792 [at] cs.columbia.edu describing your intended use.
GALE
We are participating in the DARPA Global Autonomous Language Exploitation (GALE) program, a five-year federal initiative that is seeking to go far beyond search engine technology to answer complex questions from multilingual, multimodal sources of varying types,
including blogs, talk show transcriptions as well as published news. Overall, GALE will draw on automatic speech recognition, machine translation and summarization technologies to produce answers in real time. The NLP group at Columbia is involved in the translation of Arabic source material, the annotation of automatically transcribed speech and in the collection and summarization of the end product. In the Q&A and summarization task, we draw upon a variety of technologies to locate candidate documents and analyze them, and then form a comprehensive answer.
Text to Text Generation
The goal of this project is to generate short, effective, and readable summaries from informal spoken and written texts. Documents such as automatically transcribed speech and email are not guaranteed to be either grammatical or complete, and it is thus necessary to develop
robust techniques for text-to-text generation that can transform ill-formed input into concise sentences that are maximally readable. We are in the process of developing a trainable syntax-directed sentence revision system that integrates many knowledge sources, such as syntactic transformation rules, language models, and information retrieval metrics. We expect this generation system to be applicable to a broad set of natural language applications, such as summarization and open-ended question-answering.
Contextually Sensitive Semantic Relationships
Comprehensive semantic resources, such as dictionaries and ontologies, would help natural language applications such as question answering. For example, general questions about a type of event require considerable knowledge to recognize instances of the type, and to find
the salient details of the occurrence. A computer system answering a question about political unrest in a region needs to know what geographical areas are in the region and what types of activities could be categorized as political unrest. Automated question-answering systems have made much progress on factoid questions, very specific questions about a detailed object or action for instance, 'When was Bill Clinton elected president?' But more complex, open-ended questions are much more difficult to answer automatically. They require a huge store of knowledge that doesn't exist in a form that computers can use. Manually compiling this kind of information is a daunting task. Dictionaries and encyclopedias are intended to be read, that is processed, by human beings. In addition, they are enormously difficult to construct. The available resources are often incomplete or out-of-date, and sometimes misleading. The project we are undertaking will try to avoid some of these problems by using a divide-and-conquer approach. By breaking down large collections of texts into topical clusters, we will reduce the problems of polysemy and jargon.
AQUAINT (Completed)
The goal of the AQUAINT system was to address a scenario in which multiple, inter-related questions are asked in a particular topic area by a skilled, professional information analyst who is attempting to respond to larger, more complex information needs or requirements. Within the AQUAINT project we focused on the issues involved in selecting information on different levels targeting questions of several types. These question types include definitions, opinions or biographies as answers. We built a prototype system, QUAC, which integrated the work of interpreting the question, searching for relevant documents, selecting text snippets corresponding to the answer, and formulating the answer.
This project was a collaborative effort with a team from the Colorado University Center for Spoken Language Research.
PERSIVAL (Completed)
In healthcare settings, healthcare consumers and providers both need quick and easy access to a wide range of online resources. PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video And Language resources) aims to provide personalized access to a distributed patient care digital library. PERSIVAL is a joint research initiative between the fields of NLP, human-computer interaction, medical informatics, video processing, library and cognitive science. Key features of PERSIVAL include personalized access to distributed, multimedia resources available both locally and over the Internet, fusion of repetitive information and identification of conflicting information from multiple relevant sources, and presentation of information in concise multimedia summaries that cross-link images, video, and text. When the latest medical information is provided at the point of patient care, it can help practicing clinicians to avoid missed diagnoses and minimize impending complications. When expressed in understandable terms, it can empower patients to take charge of their healthcare.
|