********************SAMPLE MIDTERM ONLY***************************** Short Answers (provide a 2-3 sentence answer for 5 of the following 8): (5 points each = 25 points) a) What is the Penn Treebank and why is it important. b) Explain how you might decide whether 'child' is a noun or an adjective in "a child seat". c) What is the difference between an adjunct and an argument in a Dependency Tree Grammar? Give examples of each. d) Give two examples each of mass and count nouns and give one example of a noun that is both. e) Distinguish between phones, phonemes, and allophones and explain why the distinction is important. f) Place of articulation g) Viterbi algorithm h) Explain the difference between entropy and perplexity 2) Exercises (Do 3 of the following 5): (15 points each = 45 points) a) List as many ways as you can think of to convey the time expression "3:45 p.m." in English. Draw a finite state automaton (or state transition table) for an automaton that recognizes these. Draw a finite state transducer that will translate these into 24 hour time. b) Calculate the MED between the source word "brang" and each of two candidate corrections, "strange" and "blanch". Show the MED table for each. Use the Levenshtein distance with cost of 2 for substitutions and 1 for insertions or deletions. c) Draw an FSA that represents the language /ba+!/. Create an FSA which translates it into the language /mo+?/. d) Write a grammar that covers the following fragment of English and construct a left-corner chart for this grammar: John gave a book to Mary. Give them a book. A book was given to John. They gave a very expensive book. The book was very expensive. e) Calculate the bigram probability of the sentence "I want a British lunch", given the following unigram and bigram counts: Bigram counts: I want a British lunch 0 859 357 452 22 57 I 0 8 1087 0 0 10 want 0 3 0 60 75 120 a 0 0 0 2 55 205 British 0 0 10 0 3 72 lunch 0 4 0 0 4 0 Unigram counts: 10,000 I 3437 want 1215 a 6342 British 359 lunch 2768 3) Short answer (Provide ~1 paragraph on 2 of the following 4): (15 points each = 30 points) a) Describe the training process for Brill's TBL part-of-speech tagger. Will sentences like "Time flies with a stopwatch" and "Time flies like an arrow" in the training data be a problem for Brill's approach? Why or why not? b) You have just been hired to work on the pronunciation module for the Really Dumb Text-to-Speech System. The RD TTS system in particular can't pronounce proper names like "Anton Chekov" , "Infiniti", or "antidisestablishmentarianism". What techniques can you recommend that would improve this system? c) Describe the major features of the Earley algorithm. What are its strengths and weaknesses? d) Compare and contrast two of the smoothing procedures described in Jurafsky and Martin.