John's Research Page

I am interested in natural language processing and machine learning. Here is some information about my research.

Current Focus: Multi-document Text Summarization

Columbia Newsblaster is a multi-document summarization system that crawls the web each day for thousands of news articles. It then categorizes, clusters, and summarizes them.

I am looking at extending Newsblaster. We would like to have Newsblaster track important events across days. We would also like to have Newsblaster process not only news articles, but also other kinds of information.

Surface Realization

FERGUS is a surface realizer. Its job is to determine what is the best way to say something, after the system has decided what must be said to the user.

I have investigated methods to ease porting of FERGUS for use in different domains. This includes the use of automatically generated linguistic resources in order to train FERGUS. This also includes the development of a graphical user interface to customize FERGUS.

In other experiments, I have looked at the difference betweeen using enormously-sized, automatically parsed corpora versus moderately-sized, hand-annotated corpora in order to train FERGUS. I have also investigated using linguistically inspired features in order to make FERGUS do a better job.

John Chen, Srinivas Bangalore, Owen Rambow, and Marilyn Walker. Towards Automatic Generation of Natural Language Generation Systems. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, 2002. [pdf]
Srinivas Bangalore, John Chen, and Owen Rambow. Impact of Quality and Quantity of Corpora on Stochastic Generation. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburgh, Pennsylvania, 2001. [pdf]

Automated Extraction of Tree-Adjoining Grammars from Treebanks

A grammar is a set of rules which, among other things, distinguishes grammatical from ungrammatical sentences in a language. A tree-adjoining grammar is a kind of grammar which has been found not only computationally useful, but also linguistically useful.

Development of hand-written tree-adjoining grammars typically requires many years of human effort. As an alternative, I have written procedures that automatically extract linguistically plausible tree-adjoining grammars from a given treebank.

John Chen and K. Vijay-Shanker. Automated Extraction of TAGs from the Penn Treebank. In Proceedings of the Sixth International Workshop on Parsing Technologies, Trento, Italy, 2000.

Supertagging

Parsing is the task of assigning the most appropriate parse tree for a given input sentence. Once the computer knows what that parse tree is, it can more easily figure out the meaning of the sentence. A perennial problem is figuring out how to perform parsing efficiently and accurately.

Supertagging has been proposed as one technique in order to accomplish exactly that. It is a preprocessing step that chooses a tag to go with each word of the input sentence, These tags can significantly reduce the search space for a parser. I have looked at ways to make supertagging more accurate without sacrificing its efficiency. I have also investigated class-based tagging as a viable variant of supertagging.

John Chen, Srinivas Bangalore, Michael Collins, and Owen Rambow. Reranking an N-Gram Supertagger. In Proceedings of the Sixth International Workshop on Tree Adjoining Grammars and Related Frameworks, Venice, Italy, 2002. [pdf]
John Chen and K. Vijay-Shanker. Automated Extraction of TAGs from the Penn Treebank. In Proceedings of the Sixth International Workshop on Parsing, Trento, Italy, 2000.
John Chen, Srinivas Bangalore, and K. Vijay-Shanker. New Models for Improving Supertag Disambiguation. In Proceedings of the Ninth Conference of the European Chapter of the Assocation for Computational Linguistics, Bergen, Norway, 1999.

Semantic Parsing

Semantic parsing is the task of computing the "meaning" of a given input sentence. The form of this meaning depends upon the kind of semantic representation that is being assumed. Here we assume local semantics, which means that the semantic role labels that are assigned to each predicate's arguments are consistent across syntactic alternations of different instances of the same predicate, but not necessarily across different predicates. We find that the use of deep syntactic features makes it easier to predict local semantics than the use of surface-oriented syntactic features.

John Chen and Owen Rambow. Use of Deep Linguistic Features for the Recognition and Labeling of Semantic Arguments. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, 2003. [pdf]

Back to my home page