CS4706 - Spring 2008
Homework 5 - "Building a speech understanding system"

Due: Monday, May 5, 2008, by 2:40pm.




In this homework, you are going to design and build your own speech understanding system for reserving train tickets given a grammatical and formal English utterance (one sentence only) that indicates the departure and destination cities, and the departure day and time.


The system will consist of two main components:

a)      An Automatic Speech Recognizer (ASR): we provide you with a script that builds the ASR using HTK (HMM toolkit). The ASR acoustic models will be trained on TIMIT, BDC, and Columbia Games corpora. The input to this component is a wav file (audio format: mono, sampling rate: 8Khz), and the output will be the automatic transcript in mlf file format (see an example below)

b)      An Understanding Component: The input to this component is the ASR transcript in (a), and the output will be a table that contains the following concepts which will be extracted automatically from the ASR transcript.

1.      Departure city: New York, Boston, Baltimore, Newark, Jersey City, Washington, Albany, Poughkeepsie, Pittsburg, Columbia  

2.      Destination city: (same as above)

3.      Departure day: Sunday, Monday, …, Saturday

4.      Departure Time: Morning, Noon, Afternoon, Evening, Night, Anytime  


Here are two examples. Given the following utterances:


1)      I would like a ticket from Boston to New York on Friday morning (LINK TO JULIA’s SPEECH1)


The output of your system should be:

Departure city



New York






2)      I need to go to Baltimore from Washington on Monday evening (LINK TO JULIA’s SPEECH2)


The output of your system should be:

Departure city










You are required to create a grammar that covers as many ways as you can think of to make your system very flexible, but your grammar must be limited, otherwise the perplexity of the ASR would be very high (which would result with a high word error rate). Part of the homework is to determine what to cover and how much (Precision vs. Recall).


Here is an example of a grammar that covers the above two examples.


I.      ASR component:

You should run the following commands to build your speech recognizer.

1. cd /proj/speech/users/cs4706/asrhw

2. mkdir USERNAME (e.g., fb2175)


#the following command will take a lot time (around 2 hours). While it’s running please read chapter 1,2,3  from the HTK toolkit book to have a general idea about what the script is going to do. You can find the details about the steps in chapter 3.

4. /proj/speech/users/cs4706/tools/htk/htk/asr/train-asr.sh USERNAME

# At this point you have your speech recognizer ready. The acoustic models (monophones and triphones) are trained on TIMIT, BDC, and games corpora.


Test your ASR:

1. mkdir /proj/speech/users/cs4706/asrhw/USERNAME/test/

2. Record two wave files (8Khz and mono) in praat that contains the two above utterances. Best performance is: leave ~1 second of silence at the beginning and ~1 second at the end of the file (while recording). Call your files test1.wav, and test2.wav, save the files into /proj/speech/users/cs4706/asrhw/USERNAME/test/  

Save the grammar from here to gram (not gram.txt) in /proj/speech/users/cs4706/asrhw/USERNAME

3. cd /proj/speech/users/cs4706/asrhw/USERNAME

4. /proj/speech/users/cs4706/tools/htk/htk/asr/recognizePath.sh USERNAME ./test

#the above script (in 4) takes as an argument a path and runs the recognizer on all the wave files in this path. Feel free to copy and change it so it will take a filename (recommended)

5. more out.mlf #to see the output of the recognizer


Now, you have a speech recognizer that takes a speech wave file (or a folder that contains your speech files) and generates the transcript in an mlf file format (example)


II.      Write a program (java is preferable) that takes an mlf file and generates the concept table (see example 1 and 2, above). Add a tab between the field name and value and new line after each pair.



                  java –jar extractConcepts.jar out.mlf



III.      Write a shell script that takes an input a wav file and generates the concept table:



                        RecognizeConcepts.sh ~/test/test2.wav


Output example:

Departure city:



New York






You will be graded based on concept error rate + task accomplishment (similar to Paradise metric you learn in class) on unseen data (you do not know how we are going to utter this utterance) recorded by prof. Julia Hirschberg. Hopefully your system will be flexible enough to capture unseen utterances (grammatical and formal English sentences), you should think about applying some heuristics (for example, if I the ASR removed TO then it’s enough to recover if I have “FROM”) to improve your model. Try to live with this noisy source. The student of the best system (minimal concept error rate) will get A+ (110)



Summation:      (50%)

1)      Your program (in II) extractConcepts.jar (extractConcepts.java, or extractConcepts.pl, ...) (15 points)

2)      Your shell script (recognizeConcepts.sh) (15 points)

3)      make.sh (that compiles your code) (10 points)

4)      readme.txt: (10 points)

                                                                           i. how to run your progtams

                                                                         ii. two examples to run part II and III


      *** Upload these files in one zip file USERNAME.zip (e.g., fb2175.zip) to courseworks



                   (50%) quality of your system