David Vespe December 13th, 2004 A Comparison of Papers on Part-of-Speech Taggers Part of speech tagging techniques progressed significantly in the mid 1990's, moving from a focus on statistical techniques toward rule-based machine learning techniques, to combinations of the two. I discuss the following papers below: Doug Cutting, Julian Kupiec, Jan Pedersen, Penelope Sibun; 1992 Corpora and tagging: A practical part-of-speech tagger Hinrich Schtze, Yoram Singe; 1994 Part-of-speech tagging using a Variable Memory Markov model Eric Brill; 1995 Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging Eric Brill; 1997 Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging Eric Brill, J. Wu; 1998 Classifier Combination for Improved Lexical Disambiguation The Standard Markov Model (Cutting 1992) presents a basic fixed-length Markov model. Their system uses a fixed-length window of one tag, and uses the frequency of the window preceding a word to predict the tag for the following word. This system allows for hand-tuned tricks to push the iterative probability assignment in the right direction. I include this paper mainly as a "no-frills" reference point. The model was repeatedly trained on half the Brown corpus, and the resulting accuracy on the other half of the Brown corpus is 96%. This number is somewhat hard to believe given the inherent problems of tagging and the limits of a single-tag-based tagger. Furthermore, there is no discussion in this paper of the specific types of errors that occurred, so it is harder to have a lot of confidence in the results. Variable-Length Markov Models The contribution of the Schutze paper is to introduce a variable-length Markov model. The idea is to allow the size of the n-gram used for part of speech prediction to be small most of the time, but to grow larger when looking at more words increases accuracy. The goal of this approach is to provide high accuracy without the higher costs associated with models where every n-gram must be the same size. Since the number of cases that must be stored grows very large as the size of the n-gram increases, and since large n-grams result sparsity and overfitting of data (because many very long sequences of words are legal but occur infrequently), small n-grams are desirable. However, there will be some cases where looking at many preceding words is necessary to determine part of speech. With this thinking, it follows that a hybrid approach should yield more accurate than a fixed n-gram approach. To choose which longer part-of-speech sequences are worth using in place of shorter ones, Schutze builds trees of part-of-speech sequences. A tree is initially a single node corresponding to one part of speech. When a two-tag sequence yields a significantly different outcome for some set of tags than does the single tag, and if the likelihood of that sequence occurring is "significant", then the two-tag sequence is added to the tree. Note that by requiring that the impact of any tree addition be significant in these two ways, the authors are trading accuracy for performance; a more accurate model might be achieved through acceptance of a greater number of tag sequences while still being able to look at a small proportion of the overall set of sequence possibilities Schutze's results, as with many other papers, include a large number of qualifications that make them hard to compare. The results use sequences of lengths of no more than two, and suggest that there are only 5 two-tag sequences that are "significant" for predicting tags. The accuracy number quoted in this paper is 95.81%, which is just slightly below the 95.97% accuracy quoted for a Markov model which always uses two-word sequences for prediction. In general, the Schutze approach aims to yield benefits in the "practical" areas of memory and time while using a Markov model to perform tagging. While this contribution may have been highly appropriate for the less powerful computing resources of the early 90's, it is unfortunate that the authors restricted themselves to using combinations of no more than two tags in their learning. If instead they had changed their definition of "significant" to allow four- and five-tag combinations of tags to be used in prediction, these might have solved some common multi-word combinations that show up as errors with the approach they used. The fact that five-word sequences cannot be used in fixed order Markov models should not have precluded their use in the more efficient variable Markov model. (It is possible that the authors investigated this, arrived at a dead end, and did not include that information in their paper.) Transformation-Based Learning (Brill 1995) presents a technique for learning transformation rules to tag a corpus. This approach aims to be significantly cheaper to execute than a Markov approach while having higher accuracy. The basic idea of the Brill system is to iteratively improve tagging accuracy by developing a set of transformations. The system first uses an initial annotator to tag a training corpus (without using labels in the corpus). It then looks at the labels to learn general transformation rules (rules are based upon the preceding and following two words and their parts of speech) that can be applied to improve the tag accuracy. The list of transformations is expanded as long as there is an available legal transformation which increases the accuracy of the training set. To tag an unlabeled corpus, the initial annotator is used, and then the rules are applied in order to achieve the final labeled data. When learning transformations, new transformations are added as long as the new transformation reduces the number of errors. A greedy approach is used, so that if there are multiple transformations that all decrease the error rate, the best is selected and applied; the expansions that would result from applying the other transformations at the current level are not investigated. (Note that the other transformations are evaluated on the output of the application of the best transformation.) When applying transformations, the order of application is important; a later transform may rely on the action of a previous transform to achieve the best result, so changing the order could have a seriously negative impact on accuracy. Brill compares his approach to decision trees, which generate a list of questions about an entity in order to tag it. Brill shows that an important difference between his approach and decision trees is that decision trees can perform only one operation in response to the answer to a question, whereas his tagger can perform multiple operations. The decision tree approach also partitions the data into classes, and has rules on a per-class basis; the partitioning can result in a data sparsity problem. Brill includes both non-lexical and lexicalized versions of his tagger. The difference is that the lexicalized tagger is allowed to form transformation rules out of actual words, whereas the non-lexical version may form rules out of parts of speech only. While there is considerable reasoning suggesting the lexicalized tagger should solve many cases better, in the end, the lexicalized tagger reduces the error rate by only 6.7% (accuracy goes from 97.0% to 97.2%). The authors suggest that the use of word class information like that available from WordNet might reduce the data sparseness and thereby improve the tagging accuracy. Brill outlines several other interesting features of this system. For one, the system can perform not just part-of-speech tagging, but also works to bracket trees in the corpus. Also, the system prevents tagging a word as a particular part of speech if the word appears in the training corpus but never takes the target part of speech in the training data. Another way the Brill system is robust is that it uses the first and last 4 letters of a word are used to try to learn part-of-speech clues which can be applied to unknown words. The results that Brill presents show considerable improvement over previous approaches. The overall accuracy -- including all the features and making minimal assumptions -- is 96.6%. The results also show that the accuracy of the transformation model asymptotically approaches a number just below 97%. One outstanding question unaddressed by this paper is the use of the greedy algorithm when choosing a transformation path. It is conceivable that a particular transformation at an early stage might yield an intermediate result that does not minimize tagging errors, but that would allow a later rule to yield a significantly higher benefit. A full state search of a transformation tree with a depth of ~400 and a branching factor significantly greater than 2 would be computationally prohibitive. But one reasonably cheap solution to this would be to, once a set of transformations been chosen, try reordering adjacent or related transformations to see if a better overall result is achieved. Perhaps better would be, for a given set of transformations found with a greedy approach, to find pairs of them that combine well, and to build and evaluate alternate orders of transformations from these. Unsupervised Learning of Disambiguation (Brill 1997) builds on his transformation-based learning approach to tagging. He introduces an approach that uses a dictionary of allowable word parts of speech but does not need a tagged corpus. The system uses an untagged corpus to build rules, in essence bootstrapping off words which have only one part of speech to form transformations for words with many parts of speech. The tagger works by initially iterating over the corpus, assigning all valid parts of speech for a given word to that word. The tagger must then choose one tag for all words with multiple tags. By composing rules as in (Brill 1995), and applying only those words which reduce the error, a set of transformations is found. The concept of reducing error is not straightforward, since the corpus is untagged, and there is thus no gold standard against which to decide which tags are correct. Instead, Brill defines a metric which positively weights a given tag based on unambiguous occurrences of that tag in similar contexts in the training data, and negatively weights the frequency with which the second-most-likely tag occurs in that context. The results of this approach are accuracies in the range of 95%-96%. The main benefit is that the approach does not require hand-labeling of the corpus, so it is easy to get many orders of magnitude more training data. In addition, Brill suggests that this approach can be used with both large untagged and comparatively small tagged corpora together to yield even better results. Classifier Combinations (Brill 1998) compares four taggers: unigram, trigram, transformation-based, and maximum-entropy. In particular, he examines the occasions when some taggers perform well and some do not. He notes that when taggers disagree, some are more consistently correct than others. He explores algorithms for combining tagger output, including a majority-rule voting method, as well as a more complicated contextual learning approach. One instance of the contextual learning system is built to learn to estimate the most likely tag given the suggested tags from each tagger; the other instance uses context to choose which tagger to listen to. These approaches show decreases in error rate by up to 10.4%; this suggests that these approaches produce better tagging than any one of the approaches on its own. The suggested future work is to add more taggers and to improve the tagger selection learning algorithm to decrease the error further. This method of corpus tagging might be expanded by looking at the several most likely tags from each tagger, rather than taking only the best one. By looking at multiple tags, cases where one tagger eliminated a tag from consideration might be applied to other taggers in a form of negative voting (e.g. "which of the candidate tags can tagger X eliminate?"). Conclusions A simple most-common-tag approach can yield an accuracy of roughly 93% (Brill 1998). Thus the challenge of any tagger is to improve upon those 7% of cases where the most common tag is not correct. Furthermore, any solution's benefits must be weighed against their costs; a tagger which requires extensive training or which takes a lot of computing time to work and delivers only a slight improvement on the most common tag approach is probably not worth using. The Markov technique reduces this error significantly by looking at a relatively small amount of context. However, since Markov models are generally constrained to consider only one or two tags of context, and constrained to consider only tags occurring before the given word, they still frequently mislabel words. Markov models also do not scale to larger contexts because of computational complexity and sparseness of data, so in the end, their application is limited. The Brill approaches make more efficient use of context, and can therefore consider more context. In addition, the transformation approach is conceptually simpler because a single Brill transformation can encompass the knowledge of a huge number of Markov probabilities in a single rule. By considering more context, Brill's tagger is able to produce tagged text with fewer errors than a more computationally expensive Markov model. Brill's final idea of combining taggers shows that since some taggers function more accurately in specific situations, machine learning can be used to improve overall accuracy by looking at the suggestions of multiple taggers.