Homework 2: Corpus Statistics Due: 30 Oct 2003 Points: 133 For this homework, you will clean up and collect various statistics on a corpus in preparation to run a machine learning experiment on topic classification in Homework 3. As you are doing this homework, think about what features of the texts might provide useful cues to distinguishing texts on topic A from texts on topic B (e.g. characteristic bigrams, document length, etc.). The corpus you will be working with is a subset of the TDT (Topic Detection and Tracking) corpus. This part of the corpus consists of broadcast news transcripts collected from CNN from 1994-1995. The format of the corpus is one text file formatted in sgml. Within the corpus file each document is surrounded by and has three fields: , , and . is the document number, and the other two are self explanatory. Here is an example: TDT000001 Topic1 Blah blah. Blah. TDT000210 Topic2 Blah blah. Blah. The corpus has been divided into three subsets: a training set, a development test set, and a validation test set. For this homework, you will have access only to the training set: /proj/nlp/users/cs4705/train.sgml ************************************************************************* 1. (54 pts.) Corpus clean-up and End of Sentence (EOS) detection. This is important for getting good corpus statistics, so that you do not, e.g., count 'today' and 'today.' as different word types and so you treat elements of 'case-study' the same as 'case study' in counting ngrams. a. EOS Detection:Mark questions (sentences ending in '?') with _SQ_, statements (sentences ending in '.', ';' or ':') with _SS_, and exclamations (!) with _SE_. For example: 'Hi. How are you? I am fine.' would expand to: 'Hi _SS_ How are you _SQ_ I am fine _SS_' For this task you will have to distinguish true end of sentences from things that may look like them, such as abbreviations. Note that abbreviations may appear at the end of sentences as well as sentence internally, e.g. 'He lives at 113 W. 15th St.' i. List the counts of each of the three sentence types in the training corpus. ii. List all of the abbreviations in the corpus that end with a period. b. Remove all word-internal punctuation except for periods associated with abbreviations (e.g. replace '-' with ' ', remove [",'()[]{}/<>|\] and any other). c. Lower-case all words (e.g. replace 'Street' with 'street', 'ABC' with 'abc'). This is important for (2) and (3). 2. (25 pts.) General Corpus Statistics. List the following for the corpus: a. The number of documents. b. The topics (as indicated by the tag) as well as the number of documents for each topic. c. The TDTID (document id number) of the longest and shortest documents (measured in words) along with the number of words they contain (Here, just use white space to delineate words, since you will have already eliminated hyphenated words, e.g., in 1.b.). 3. (54 pts.) Ngram statistics. Collect unigram and bigram frequencies for the corpus (no smoothing). For unigrams this would be c(x)/N and for bigrams c(x1,x2)/N where c(x) is the count of token x, c(x1,x2) is the count of the bigram 'x1 x2', and N is the total number of tokens. These should be calculated using the corpus you have 'cleaned up' in (1), with all words lowercase and sentence internal punctuation removed. a. List the number of unigram and bigram types in the corpus. b. List the number of unigram and bigram tokens in the corpus. c. Calculate unigram and bigram frequencies over the entire corpus. List the 50 most frequent unigrams and bigrams in the corpus excluding the function words in: /proj/nlp/users/cs4705/stoplist.txt Also print out their frequencies. Present in order of descending frequency. d. Do the following for each topic. Calculate the unigram and bigram frequencies. Take the top 50 unigrams and bigrams, excluding those in the stop list. Now that you have these lists, for each topic list those unigrams and bigrams that do not occur in any of the other topics' top 50 lists. Also provide two frequency counts for each unigram and bigram as computed: (i) over the whole corpus and (ii) over a subset of the corpus that corresponds to the topic. e. Produce a T x T matrix (where T is the the number of topics in the corpus) for unigrams and bigrams. For each entry in the matrix, list the number of top 50 ngrams that any two topics have in common. Briefly discuss which topics are most similar and dissimilar by this measure.