David Vespe
December 13th, 2004

A Comparison of Papers on Part-of-Speech Taggers


Part of speech tagging techniques progressed significantly in the mid
1990's, moving from a focus on statistical techniques toward
rule-based machine learning techniques, to combinations of the two.  I
discuss the following papers below:

  Doug Cutting, Julian Kupiec, Jan Pedersen, Penelope Sibun; 1992
    Corpora and tagging: A practical part-of-speech tagger

  Hinrich Schtze, Yoram Singe; 1994
    Part-of-speech tagging using a Variable Memory Markov model

  Eric Brill; 1995
    Transformation-Based Error-Driven Learning and Natural Language
    Processing: A Case Study in Part of Speech Tagging

  Eric Brill; 1997
    Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging

  Eric Brill, J. Wu; 1998
    Classifier Combination for Improved Lexical Disambiguation


The Standard Markov Model


(Cutting 1992) presents a basic fixed-length Markov model.  Their
system uses a fixed-length window of one tag, and uses the frequency
of the window preceding a word to predict the tag for the following
word.  This system allows for hand-tuned tricks to push the iterative
probability assignment in the right direction.

I include this paper mainly as a "no-frills" reference point.  The
model was repeatedly trained on half the Brown corpus, and the
resulting accuracy on the other half of the Brown corpus is 96%.  This
number is somewhat hard to believe given the inherent problems of
tagging and the limits of a single-tag-based tagger.  Furthermore,
there is no discussion in this paper of the specific types of errors
that occurred, so it is harder to have a lot of confidence in the
results.


Variable-Length Markov Models


The contribution of the Schutze paper is to introduce a variable-length
Markov model.  The idea is to allow the size of the n-gram used for
part of speech prediction to be small most of the time, but to grow
larger when looking at more words increases accuracy.

The goal of this approach is to provide high accuracy without the
higher costs associated with models where every n-gram must be the
same size.  Since the number of cases that must be stored grows very
large as the size of the n-gram increases, and since large n-grams
result sparsity and overfitting of data (because many very long
sequences of words are legal but occur infrequently), small n-grams
are desirable.  However, there will be some cases where looking at
many preceding words is necessary to determine part of speech.  With
this thinking, it follows that a hybrid approach should yield more
accurate than a fixed n-gram approach.

To choose which longer part-of-speech sequences are worth using in
place of shorter ones, Schutze builds trees of part-of-speech
sequences.  A tree is initially a single node corresponding to one
part of speech.  When a two-tag sequence yields a significantly
different outcome for some set of tags than does the single tag, and
if the likelihood of that sequence occurring is "significant", then
the two-tag sequence is added to the tree.  Note that by requiring
that the impact of any tree addition be significant in these two ways,
the authors are trading accuracy for performance; a more accurate
model might be achieved through acceptance of a greater number of tag
sequences while still being able to look at a small proportion of the
overall set of sequence possibilities

Schutze's results, as with many other papers, include a large number
of qualifications that make them hard to compare.  The results use
sequences of lengths of no more than two, and suggest that there are
only 5 two-tag sequences that are "significant" for predicting tags.

The accuracy number quoted in this paper is 95.81%, which is just
slightly below the 95.97% accuracy quoted for a Markov model which
always uses two-word sequences for prediction.

In general, the Schutze approach aims to yield benefits in the
"practical" areas of memory and time while using a Markov model to
perform tagging.  While this contribution may have been highly
appropriate for the less powerful computing resources of the early
90's, it is unfortunate that the authors restricted themselves to
using combinations of no more than two tags in their learning.

If instead they had changed their definition of "significant" to allow
four- and five-tag combinations of tags to be used in prediction,
these might have solved some common multi-word combinations that show
up as errors with the approach they used.  The fact that five-word
sequences cannot be used in fixed order Markov models should not have
precluded their use in the more efficient variable Markov model.  (It
is possible that the authors investigated this, arrived at a dead end,
and did not include that information in their paper.)


Transformation-Based Learning


(Brill 1995) presents a technique for learning transformation rules to
tag a corpus.  This approach aims to be significantly cheaper to
execute than a Markov approach while having higher accuracy.

The basic idea of the Brill system is to iteratively improve tagging
accuracy by developing a set of transformations.  The system first
uses an initial annotator to tag a training corpus (without using
labels in the corpus).  It then looks at the labels to learn general
transformation rules (rules are based upon the preceding and following
two words and their parts of speech) that can be applied to improve
the tag accuracy.  The list of transformations is expanded as long as
there is an available legal transformation which increases the
accuracy of the training set.  To tag an unlabeled corpus, the initial
annotator is used, and then the rules are applied in order to achieve
the final labeled data.

When learning transformations, new transformations are added as long
as the new transformation reduces the number of errors.  A greedy
approach is used, so that if there are multiple transformations that
all decrease the error rate, the best is selected and applied; the
expansions that would result from applying the other transformations
at the current level are not investigated.  (Note that the other
transformations are evaluated on the output of the application of the
best transformation.)  When applying transformations, the order of
application is important; a later transform may rely on the action of
a previous transform to achieve the best result, so changing the order
could have a seriously negative impact on accuracy.

Brill compares his approach to decision trees, which generate a list
of questions about an entity in order to tag it.  Brill shows that an
important difference between his approach and decision trees is that
decision trees can perform only one operation in response to the
answer to a question, whereas his tagger can perform multiple
operations.  The decision tree approach also partitions the data into
classes, and has rules on a per-class basis; the partitioning can
result in a data sparsity problem.

Brill includes both non-lexical and lexicalized versions of his
tagger.  The difference is that the lexicalized tagger is allowed to
form transformation rules out of actual words, whereas the non-lexical
version may form rules out of parts of speech only.  While there is
considerable reasoning suggesting the lexicalized tagger should solve
many cases better, in the end, the lexicalized tagger reduces the
error rate by only 6.7% (accuracy goes from 97.0% to 97.2%).  The
authors suggest that the use of word class information like that
available from WordNet might reduce the data sparseness and thereby
improve the tagging accuracy.

Brill outlines several other interesting features of this system.  For
one, the system can perform not just part-of-speech tagging, but also
works to bracket trees in the corpus.  Also, the system prevents
tagging a word as a particular part of speech if the word appears in
the training corpus but never takes the target part of speech in the
training data.  Another way the Brill system is robust is that it uses
the first and last 4 letters of a word are used to try to learn
part-of-speech clues which can be applied to unknown words.

The results that Brill presents show considerable improvement over
previous approaches.  The overall accuracy -- including all the
features and making minimal assumptions -- is 96.6%.  The results also
show that the accuracy of the transformation model asymptotically
approaches a number just below 97%.

One outstanding question unaddressed by this paper is the use of the
greedy algorithm when choosing a transformation path.  It is
conceivable that a particular transformation at an early stage might
yield an intermediate result that does not minimize tagging errors,
but that would allow a later rule to yield a significantly higher
benefit.

A full state search of a transformation tree with a depth of ~400 and
a branching factor significantly greater than 2 would be
computationally prohibitive.  But one reasonably cheap solution to
this would be to, once a set of transformations been chosen, try
reordering adjacent or related transformations to see if a better
overall result is achieved.  Perhaps better would be, for a given set
of transformations found with a greedy approach, to find pairs of them
that combine well, and to build and evaluate alternate orders of
transformations from these.


Unsupervised Learning of Disambiguation


(Brill 1997) builds on his transformation-based learning approach to
tagging.  He introduces an approach that uses a dictionary of
allowable word parts of speech but does not need a tagged corpus.
The system uses an untagged corpus to build rules, in essence
bootstrapping off words which have only one part of speech to form
transformations for words with many parts of speech.

The tagger works by initially iterating over the corpus, assigning all
valid parts of speech for a given word to that word.  The tagger must
then choose one tag for all words with multiple tags.  By composing
rules as in (Brill 1995), and applying only those words which reduce
the error, a set of transformations is found.

The concept of reducing error is not straightforward, since the corpus
is untagged, and there is thus no gold standard against which to
decide which tags are correct.  Instead, Brill defines a metric which
positively weights a given tag based on unambiguous occurrences of
that tag in similar contexts in the training data, and negatively
weights the frequency with which the second-most-likely tag occurs in
that context.

The results of this approach are accuracies in the range of 95%-96%.
The main benefit is that the approach does not require hand-labeling
of the corpus, so it is easy to get many orders of magnitude more
training data.  In addition, Brill suggests that this approach can be
used with both large untagged and comparatively small tagged corpora
together to yield even better results.


Classifier Combinations


(Brill 1998) compares four taggers: unigram, trigram,
transformation-based, and maximum-entropy.  In particular, he examines
the occasions when some taggers perform well and some do not.  He
notes that when taggers disagree, some are more consistently correct
than others.  He explores algorithms for combining tagger output,
including a majority-rule voting method, as well as a more complicated
contextual learning approach.  One instance of the contextual learning
system is built to learn to estimate the most likely tag given the
suggested tags from each tagger; the other instance uses context to
choose which tagger to listen to.

These approaches show decreases in error rate by up to 10.4%; this
suggests that these approaches produce better tagging than any one of
the approaches on its own.  The suggested future work is to add more
taggers and to improve the tagger selection learning algorithm to
decrease the error further.

This method of corpus tagging might be expanded by looking at the
several most likely tags from each tagger, rather than taking only the
best one.  By looking at multiple tags, cases where one tagger
eliminated a tag from consideration might be applied to other taggers
in a form of negative voting (e.g. "which of the candidate tags can
tagger X eliminate?").


Conclusions


A simple most-common-tag approach can yield an accuracy of roughly 93%
(Brill 1998).  Thus the challenge of any tagger is to improve upon
those 7% of cases where the most common tag is not correct.
Furthermore, any solution's benefits must be weighed against their
costs; a tagger which requires extensive training or which takes a lot
of computing time to work and delivers only a slight improvement on
the most common tag approach is probably not worth using.

The Markov technique reduces this error significantly by looking at a
relatively small amount of context.  However, since Markov models are
generally constrained to consider only one or two tags of context, and
constrained to consider only tags occurring before the given word,
they still frequently mislabel words.  Markov models also do not scale
to larger contexts because of computational complexity and sparseness
of data, so in the end, their application is limited. 

The Brill approaches make more efficient use of context, and can
therefore consider more context.  In addition, the transformation
approach is conceptually simpler because a single Brill transformation
can encompass the knowledge of a huge number of Markov probabilities
in a single rule.  By considering more context, Brill's tagger is able
to produce tagged text with fewer errors than a more computationally
expensive Markov model.

Brill's final idea of combining taggers shows that since some taggers
function more accurately in specific situations, machine learning can
be used to improve overall accuracy by looking at the suggestions of
multiple taggers.