Homework 2: Corpus Statistics
Due: 30 Oct 2003
Points: 133
For this homework, you will clean up and collect various statistics on
a corpus in preparation to run a machine learning experiment on topic
classification in Homework 3. As you are doing this homework, think
about what features of the texts might provide useful cues to
distinguishing texts on topic A from texts on topic B (e.g.
characteristic bigrams, document length, etc.).
The corpus you will be working with is a subset of the TDT (Topic
Detection and Tracking) corpus. This part of the corpus consists of
broadcast news transcripts collected from CNN from 1994-1995. The
format of the corpus is one text file formatted in sgml. Within the
corpus file each document is surrounded by and has three
fields: , , and . is the document number,
and the other two are self explanatory. Here is an example:
TDT000001
Topic1
Blah blah.
Blah.
TDT000210
Topic2
Blah blah.
Blah.
The corpus has been divided into three subsets: a training set, a
development test set, and a validation test set. For this homework,
you will have access only to the training set:
/proj/nlp/users/cs4705/train.sgml
*************************************************************************
1. (54 pts.) Corpus clean-up and End of Sentence (EOS) detection.
This is important for getting good corpus statistics, so that you
do not, e.g., count 'today' and 'today.' as different word types
and so you treat elements of 'case-study' the same as 'case study'
in counting ngrams.
a. EOS Detection:Mark questions (sentences ending in '?') with
_SQ_, statements (sentences ending in '.', ';' or ':') with
_SS_, and exclamations (!) with _SE_. For example:
'Hi. How are you? I am fine.'
would expand to:
'Hi _SS_ How are you _SQ_ I am fine _SS_'
For this task you will have to distinguish true end of sentences
from things that may look like them, such as abbreviations.
Note that abbreviations may appear at the end of sentences as
well as sentence internally, e.g. 'He lives at 113 W. 15th St.'
i. List the counts of each of the three sentence types in the
training corpus.
ii. List all of the abbreviations in the corpus that end with
a period.
b. Remove all word-internal punctuation except for periods
associated with abbreviations (e.g. replace '-' with ' ', remove
[",'()[]{}/<>|\] and any other).
c. Lower-case all words (e.g. replace 'Street' with 'street', 'ABC'
with 'abc'). This is important for (2) and (3).
2. (25 pts.) General Corpus Statistics. List the following for the
corpus:
a. The number of documents.
b. The topics (as indicated by the tag) as well as the
number of documents for each topic.
c. The TDTID (document id number) of the longest and shortest
documents (measured in words) along with the number of words
they contain (Here, just use white space to delineate words,
since you will have already eliminated hyphenated words, e.g.,
in 1.b.).
3. (54 pts.) Ngram statistics. Collect unigram and bigram frequencies
for the corpus (no smoothing). For unigrams this would be c(x)/N
and for bigrams c(x1,x2)/N where c(x) is the count of token x,
c(x1,x2) is the count of the bigram 'x1 x2', and N is the total
number of tokens. These should be calculated using the corpus you
have 'cleaned up' in (1), with all words lowercase and sentence
internal punctuation removed.
a. List the number of unigram and bigram types in the corpus.
b. List the number of unigram and bigram tokens in the corpus.
c. Calculate unigram and bigram frequencies over the entire
corpus. List the 50 most frequent unigrams and bigrams in the
corpus excluding the function words in:
/proj/nlp/users/cs4705/stoplist.txt
Also print out their frequencies. Present in order of
descending frequency.
d. Do the following for each topic. Calculate the unigram and
bigram frequencies. Take the top 50 unigrams and bigrams,
excluding those in the stop list.
Now that you have these lists, for each topic list those
unigrams and bigrams that do not occur in any of the other
topics' top 50 lists.
Also provide two frequency counts for each unigram and bigram
as computed: (i) over the whole corpus and (ii) over a subset
of the corpus that corresponds to the topic.
e. Produce a T x T matrix (where T is the the number of topics in
the corpus) for unigrams and bigrams. For each entry in the
matrix, list the number of top 50 ngrams that any two topics
have in common. Briefly discuss which topics are most similar
and dissimilar by this measure.