Homework 2: Corpus Statistics
		 Due: 30 Oct 2003
		 Points: 133


For this homework, you will clean up and collect various statistics on
a corpus in preparation to run a machine learning experiment on topic
classification in Homework 3.  As you are doing this homework, think
about what features of the texts might provide useful cues to
distinguishing texts on topic A from texts on topic B (e.g.
characteristic bigrams, document length, etc.).

The corpus you will be working with is a subset of the TDT (Topic
Detection and Tracking) corpus.  This part of the corpus consists of
broadcast news transcripts collected from CNN from 1994-1995.  The
format of the corpus is one text file formatted in sgml.  Within the
corpus file each document is surrounded by <DOC></DOC> and has three
fields: <TDTID>, <TOPIC>, and <TEXT>.  <TDTID> is the document number,
and the other two are self explanatory.  Here is an example:

	  <DOC>
	  <TDTID>TDT000001</TDTID>
	  <TOPIC>Topic1</TOPIC>
	  <TEXT>
	  Blah blah.
	  Blah.
	  </TEXT>
	  </DOC>
	  <DOC>
	  <TDTID>TDT000210</TDTID>
	  <TOPIC>Topic2</TOPIC>
	  <TEXT>
	  Blah blah.
	  Blah.
	  </TEXT>
	  </DOC>


The corpus has been divided into three subsets: a training set, a
development test set, and a validation test set.  For this homework,
you will have access only to the training set:

	      /proj/nlp/users/cs4705/train.sgml


*************************************************************************

1. (54 pts.) Corpus clean-up and End of Sentence (EOS) detection.
   This is important for getting good corpus statistics, so that you
   do not, e.g., count 'today' and 'today.' as different word types
   and so you treat elements of 'case-study' the same as 'case study'
   in counting ngrams. 

   a. EOS Detection:Mark questions (sentences ending in '?') with
      _SQ_, statements (sentences ending in '.', ';' or ':') with
      _SS_, and exclamations (!) with _SE_.  For example:

		  'Hi.  How are you? I am fine.' 

      would expand to: 

		   'Hi _SS_ How are you _SQ_ I am fine _SS_'

      For this task you will have to distinguish true end of sentences
      from things that may look like them, such as abbreviations.
      Note that abbreviations may appear at the end of sentences as
      well as sentence internally, e.g. 'He lives at 113 W. 15th St.'

        i. List the counts of each of the three sentence types in the
           training corpus.

       ii. List all of the abbreviations in the corpus that end with
           a period.

   b. Remove all word-internal punctuation except for periods
      associated with abbreviations (e.g. replace '-' with ' ', remove
      [",'()[]{}/<>|\] and any other).

   c. Lower-case all words (e.g. replace 'Street' with 'street', 'ABC'
      with 'abc').  This is important for (2) and (3).

2. (25 pts.) General Corpus Statistics.  List the following for the
   corpus:

    a. The number of documents.

    b. The topics (as indicated by the <TOPIC> tag) as well as the
       number of documents for each topic.

    c. The TDTID (document id number) of the longest and shortest
       documents (measured in words) along with the number of words
       they contain (Here, just use white space to delineate words,
       since you will have already eliminated hyphenated words, e.g.,
       in 1.b.).


3. (54 pts.) Ngram statistics.  Collect unigram and bigram frequencies
   for the corpus (no smoothing).  For unigrams this would be c(x)/N
   and for bigrams c(x1,x2)/N where c(x) is the count of token x,
   c(x1,x2) is the count of the bigram 'x1 x2', and N is the total
   number of tokens.  These should be calculated using the corpus you
   have 'cleaned up' in (1), with all words lowercase and sentence
   internal punctuation removed.

    a. List the number of unigram and bigram types in the corpus.

    b. List the number of unigram and bigram tokens in the corpus.

    c. Calculate unigram and bigram frequencies over the entire
       corpus.  List the 50 most frequent unigrams and bigrams in the
       corpus excluding the function words in:
		
		/proj/nlp/users/cs4705/stoplist.txt

       Also print out their frequencies.  Present in order of
       descending frequency.

    d. Do the following for each topic.  Calculate the unigram and
       bigram frequencies.  Take the top 50 unigrams and bigrams,
       excluding those in the stop list.

       Now that you have these lists, for each topic list those
       unigrams and bigrams that do not occur in any of the other
       topics' top 50 lists.

       Also provide two frequency counts for each unigram and bigram
       as computed: (i) over the whole corpus and (ii) over a subset
       of the corpus that corresponds to the topic.

    e. Produce a T x T matrix (where T is the the number of topics in
       the corpus) for unigrams and bigrams.  For each entry in the
       matrix, list the number of top 50 ngrams that any two topics
       have in common.  Briefly discuss which topics are most similar
       and dissimilar by this measure.