AIS Final Project write-up

By Janak J Parekh and Scott Susser

Introduction

Relative temporal networks as proposed by Allen are extremely useful for a wide range of applications. However, they do not adequately handle the requirements of establishing constraints within prose or news literature, since the constraints are explicitly specified and enumerated. With textual natural-language processing, we aim to apply the temporal network algorithms previously presented to a corpus of articles, broadly expanding the potential applications for the Allen algorithms.

Problem

Given an implementation of Allen's algorithm and a series of articles, we will preprocess the articles, feed the resulting relationships into the Allen network, get the resulting temporal network, and flatten it into a sequence of events.

Approach

CASS preprocessing of articles. We preprocess the articles using some well-known NLP tools. First, the corpus of articles are run through CASS (http://www.sfs.nphil.uni-tuebingen.de/~abney/), a well-known deterministic partial parser for large amounts of text. It parses sentences using a grammar (which was defined by us) into fragments in a machine-readable form. The grammar is defined bottom-up, and looks like a combination of BNF and regular expressions:

:np
 
    NOM = nx | mx | cdx | place | person | name | ci-st | doll | date | today;
    np -> only? (NOM | nmess) pos h=NOM?;
    
The :np delineates this section as a noun-phrase section; then, the noun phrase can be developed from any of the rules such as nx, mx, cdx, and so on, which have already been defined ahead of time in the grammar. This allows us to build up a grammar from the low-level constituients into higher-level phrases and finally to sentences and entire documents. Having done this, we can then run CASS on the text, and output similar to the following excerpt is generated.
  [nbest
    [nx
      [name
        [nnp Clinton]]]]
  [dsh :]
  [nbest
    [nx
      [prp It]]
    
is a brief excerpt of CASS output for a text article. As you can see, Clinton is assigned a noun term, which is classified within name, and so on. Classifications are made based on the grammar and a tagfile, which exists to eliminate ambiguities about certain terms (e.g. CASS has no knowledge about the word "Clinton", so we can add a tag for it). CASS is also extremely useful because it has a part-of-speech tagger. All told, after having specified the desired grammar it takes care of most of the necessary sentence parsing.

The major modification made to the CASS POS tagger was to recognized user defined temporal words, and their surrounding events. Six words were defined as key words, and were designated (with the option of modifiers) as temporal words. These words are: before, after, during, until, while, and meanwhile. In addition, a second new classification was devised. This classification is an umbrella noun, it recognizes sets of noun phrases that make up an event occuring before or after a temporal event.

The next stage in processing was to search the CASS tagged article and extract the events and temporal phrases and pair them appropriately. A parse program was written in C that searched the CASS tagged file for an event. Once an event was found, the parser searched for a temporal phrase. If one was found, a temporal relation was writen to the output file. If not, the parser reset and searched for another another temporal event.

Conversion into the temporal network format. The output generated by the C program to parse the CASS input was then turned over to a Java program which read in the data and translated it into a temporal format, by using the appropriate Allen constraints. At this point, the data was then fed into the network, and propagation was completed.

Flattening of the temporal network. The next step, and perhaps one of the trickiest, was to take the temporal network and extract the phrases in chronological order, since this requires a "flattening" of the temporal network. A complete flattening algorithm is non-trivial; after several attemps, a simpler, recursive greedy algorithm was adopted. The algorithm works as follows:

flattenNetwork(network net) {
  while(there are unprocessed nodes in net) {
    nextNode = first unprocessed node in net;
    place nextNode randomly;
    flag nextNode as having been processed;
    process(nextNode);
  }

  return array of processed nodes;
}

process(node n) {
  scan all relationships of n;
  while(any nodes related to n are unprocessed) {
    newNode = next unprocessed node related to n;
    determine constraint with newNode;
    place newNode appropriately;
    flag newNode as having been processed;
    process(newNode);
  }
}
This algorithms is no by means complete. Since it is greedy, it will not go back and check to see if the constraints keep on being maintained. However, the advantage of this algorithm is that it is straightforward, fast, and that it works quite decently for most practical domains.

Having done that, we simply output the flattened array.

Results

Our results are not precisely quantifiable, because (a) during the course of this project several problems arose that made some of the results difficult; and (b) it's difficult to calculate appropriate metrics for these results in general. However, here's an example of the results that were generated: (this was taken from the composite of events in three articles)
--> [0]: 600 sorties over the previous 24 hours
--> [1]: the measure to Clinton by week 's end
--> [2]: 240 bombing runs made
--> [3]: its effort
--> [4]: That would let him wage his fight later
--> [5]: Belgrade embassy attacked prgrph
--> [6]: the first air raid on the Yugoslav capital in three days
--> [7]: President Boris Yeltsin said Russia could quit its role in
--> [8]: Russia
--> [9]: House members argued that
--> [10]: The Chinese embassy in Belgrade was damaged
--> [11]: a NATO airstrike late Friday
--> [12]: Fish and Wildlife officials
--> [13]: blocked a species from being listed as
--> [14]: it  '' prgrph
--> [15]: The embassy was damaged
--> [16]: Kosovo
--> [17]: our bombing campaign goes on
--> [18]: our demands are met
--> [19]: NATO has gone through in
--> [20]: one is going to come
--> [21]: you and
--> [22]: UN administration for the province
--> [23]: Friday 's bombing
--> [24]: were made only hours
--> [25]: the House approved $13 billion last week
--> [26]: the NATO airstrikes
--> [27]: House Republicans balked
--> [28]: anti-violence programs in schools went
--> [29]: more might be added
--> [30]: the measure is finished
--> [31]: a crackdown in February
--> [32]: special police it had deployed in Kosovo
--> [33]: commercial fishing in the waters off Glacier Bay National Park
--> [34]: courts decide the issue
--> [35]: a framework for self-rule within Yugoslavia can be developed
--> [36]: is that
--> [37]: YELTSIN MADE his remarks
--> [38]: a meeting of his Security Council held hours after
--> [39]: the United Nations -- not
--> [40]: Chinese demand UN meeting
Some problems are immediately apparent, and are discussed below.

Problems. It turns out that this particular approach to the problem does not generate the best results, for a number of reasons:

  1. Event naming similarities, or lack thereof. In many situations, especially in news articles which served as our corpus, the same object is defined in different ways; for example, there's the "Kosovo conflict" or the "Kosovo massacre" or the "confict in Kosovo". There are a number of ways of addressing this; what particularly appealed to us at first was to establish equivalency lists. However, the lists have to be created manually and maintained manually; worse, these lists get very large.
    Possible solution: Particular approaches that might alleviate this problem somewhat include using some sort of learning methodology to automatically determine the associations, or using some technique like precision and recall to determine association rules; however, we did not have time to implement these.
  2. Problem with the constraints themselves. Unfortunately, "before" does not necessarily always imply that a constraint is present between two objects. Consider the sentence
    Before, Kosovars and Serbs lived in harmony for many years.
    The question that must be asked is before WHAT? Unfortunately, this is taken from the context of the situation, and assumes readers' prior knowledge. The temporal constraint network does not alleviate this particular situation. Other problems include situations with words like while and meanwhile:
    Meanwhile, a bomb fell.
    Again, the same problem is encountered.
    Possible solutions: Inclusion of context concepts, so that the system is aware of previous knowledge. This is a full task in itself.
  3. Lack of constraint terms in general. Even worse is the apparent lack of these words by news sources. In our informal corpus, temporal constraints were only found in 30-50% of the articles we extracted, which represent a random sampling of the war in Kosovo. Words like "before" and "after" are not included, and there is an implicit ordering rather than an explicit one. This particular problem requires us to manually filter through articles and pull out ones that particularly have constraints.
    Possible solution: Use intelligent filtering and retrieval software, and try to extract constraints with explicit mentioning.
  4. Lack of cohesiveness. Articles often touch upon many different subjects, and as a result some of the constraint results look unusual, although they are often correct.
    Possible solution: With larger corpuses, at least enough events will be present.
It's worth noting that, despite the problems, the system does fairly well on listing the constraints in articles.

Conclusions

There are several conclusions we can draw about the system:
Janak J Parekh
Scott A Susser
Last modified: Thu May 13 12:50:01 EDT 1999