John's Research Page
I am interested in natural language processing and machine
learning. Here is some information about my research.
Current Focus: Multi-document Text Summarization
Columbia Newsblaster
is a multi-document summarization system that crawls the web each day
for thousands of news articles. It then categorizes, clusters, and
summarizes them.
I am looking at extending Newsblaster. We would like to have
Newsblaster track important events across days. We would also like
to have Newsblaster process not only news articles, but also other
kinds of information.
Surface Realization
FERGUS is a surface realizer. Its job is to determine what is the
best way to say something, after the system has decided what must be
said to the user.
I have investigated methods to ease porting of FERGUS for use in
different domains. This includes the use of automatically generated
linguistic resources in order to train FERGUS. This also includes the
development of a graphical user interface to customize FERGUS.
In other experiments, I have looked at the difference betweeen using
enormously-sized, automatically parsed corpora versus moderately-sized,
hand-annotated corpora in order to train FERGUS. I have also investigated
using linguistically inspired features in order to make FERGUS do a
better job.
-
John Chen, Srinivas Bangalore, Owen Rambow, and Marilyn Walker.
Towards Automatic Generation of Natural Language Generation
Systems. In Proceedings of the 19th International Conference on
Computational Linguistics (COLING 2002), Taipei, Taiwan, 2002.
[pdf]
-
Srinivas Bangalore, John Chen, and Owen Rambow. Impact of
Quality and Quantity of Corpora on Stochastic Generation. In
Proceedings of the 2001 Conference on Empirical Methods in
Natural Language Processing, Pittsburgh, Pennsylvania, 2001.
[pdf]
Automated Extraction of Tree-Adjoining Grammars from Treebanks
A grammar is a set of rules which, among other things, distinguishes
grammatical from ungrammatical sentences in a language. A
tree-adjoining grammar is a kind of grammar which has been found not only
computationally useful, but also linguistically useful.
Development of hand-written tree-adjoining grammars typically
requires many years of human effort. As an alternative, I have written
procedures that automatically extract linguistically plausible
tree-adjoining grammars from a given treebank.
-
John Chen and K. Vijay-Shanker. Automated Extraction of TAGs
from the Penn Treebank. In Proceedings of the Sixth International
Workshop on Parsing Technologies, Trento, Italy, 2000.
Supertagging
Parsing is the task of assigning the most appropriate parse tree
for a given input sentence. Once the computer knows what that parse
tree is, it can more easily figure out the meaning of the sentence.
A perennial problem is figuring out how to perform parsing efficiently
and accurately.
Supertagging has been proposed as one technique in order to accomplish
exactly that. It is a preprocessing step that chooses a tag to go with
each word of the input sentence, These tags can significantly
reduce the search space for a parser. I have looked at ways to make
supertagging more accurate without sacrificing its efficiency. I have
also investigated class-based tagging as a viable variant of
supertagging.
-
John Chen, Srinivas Bangalore, Michael Collins, and Owen
Rambow. Reranking an N-Gram Supertagger. In Proceedings
of the Sixth International Workshop on Tree Adjoining Grammars
and Related Frameworks, Venice, Italy, 2002.
[pdf]
-
John Chen and K. Vijay-Shanker. Automated Extraction of TAGs
from the Penn Treebank. In Proceedings of the Sixth International
Workshop on Parsing, Trento, Italy, 2000.
-
John Chen, Srinivas Bangalore, and K. Vijay-Shanker. New Models
for Improving Supertag Disambiguation. In Proceedings of the
Ninth Conference of the European Chapter of the Assocation for
Computational Linguistics, Bergen, Norway, 1999.
Semantic Parsing
Semantic parsing is the task of computing the "meaning" of a
given input sentence. The form of this meaning depends upon the
kind of semantic representation that is being assumed. Here we assume
local semantics, which means that the semantic role labels that are assigned
to each predicate's arguments are consistent across syntactic alternations
of different instances of the same predicate, but not necessarily across
different predicates. We find that the use of deep syntactic features
makes it easier to predict local semantics than the use of surface-oriented
syntactic features.
-
John Chen and Owen Rambow. Use of Deep Linguistic Features for
the Recognition and Labeling of Semantic Arguments. In Proceedings
of the 2003 Conference on Empirical Methods in Natural Language
Processing, Sapporo, Japan, 2003.
[pdf]