CS4706 - Spring 2008
Homework 5 - "Building a speech understanding system"

Due: Monday, May 5, 2008, by 2:40pm.




In this homework, you will design and build your own speech understanding system for reserving train tickets.  You should assume as input a grammatical English utterance (one sentence only) that indicates the departure and destination cities, and the departure day and time.  We will provide training data to help you test your system, but you should also record your own utterances for further testing.


The system will consist of two main components:

(a) An Automatic Speech Recognition (ASR) System:  We provide a script that builds the ASR component using HTK (an HMM toolkit). The ASR acoustic models will be trained on TIMIT, BDC, and the Columbia Games corpora. The input to this component is a wav file (audio format: mono, sampling rate: 8Khz), and the output will be the automatic transcript in mlf file format (see an example below)

(b) An Understanding Component: The input to this component is the ASR transcript in (a), and the output will be a table that contains the following concepts which you will be expected to extract automatically from the ASR transcript.

Departure city: New York, Boston, Baltimore, Newark, Jersey City, Washington, Albany, Poughkeepsie, Pittsburg, Columbia  

Destination city: (same set of cities as above)

Departure day: Sunday, Monday, …, Saturday

Departure Time: Morning, Noon, Afternoon, Evening, Night, Anytime  


Here are two examples. Given the following utterances:


I would like a ticket from Boston to New York on Friday morning (LINK TO JULIA’s SPEECH1)


The output of your system should be:

Departure city



New York






I need to go to Baltimore from Washington on Monday evening (LINK TO JULIA’s SPEECH2)


The output of your system should be:

Departure city










You should create a grammar that covers as many different ways of expressing these sorts of requests as you can think of to make your system very flexible.  However, your grammar should be limited enough so that the ASR perplexity is not so high as to affect performance.  You must experiment to see what trade-offs between flexibility and performance work best.  Part of the homework assignment is to see how well you can determine how much you can cover with reasonable performance.  Note that your success will be judged on concept accuracy, not transcription accuracy.


Here is an example of a grammar that covers the two examples above.


To build the ASR component,  run the following commands:


1. cd /proj/speech/users/cs4706/asrhw

2. mkdir USERNAME (e.g., fb2175)


#The following command will take about 2 hours!). While it is running, you should read Chapters 1,2, and 3  in the HTK toolkit book to understand what the script is doing.  

4. /proj/speech/users/cs4706/tools/htk/htk/asr/train-asr.sh USERNAME

# When this command completes, you have your speech recognizer ready. The acoustic models (monophones and triphones) are trained on TIMIT, BDC, and games corpora.


Next, test your ASR system:


1. mkdir /proj/speech/users/cs4706/asrhw/USERNAME/test/  

2. Record the two utterances above as wave files (8Khz and mono) in praat. For best recognition performance, leave ~1 second of silence at the beginning and ~1 second at the end of the file when recording. Call your files test1.wav and test2.wav and save the files in /proj/speech/users/cs4706/asrhw/USERNAME/test/.

Save the grammar here to gram (not gram.txt) in /proj/speech/users/cs4706/asrhw/USERNAME

3. cd /proj/speech/users/cs4706/asrhw/USERNAME

4. Run /proj/speech/users/cs4706/tools/htk/htk/asr/recognizePath.sh USERNAME ./test

#Not that the script in 4 takes a path as its argument and runs the recognizer on all the wave files in this directory. Feel free to change the script to accept a filename as its argument so that you will have output for each utterance in a separate file, e.g.

5. Check the output of your recognizer in  /proj/speech/users/cs4706/asrhw/USERNAME/ out.mlf .


Now, you have a speech recognizer that takes a speech wave file (or a directory containing a set of wave files) and generates the transcript in mlf file format (example)


Next, you must write a program that takes a wav file, runs the ASR System, and generates the concept table shown in examples 1 and 2 above. Add a tab between the field name and value and new line after each concept/value pair.  (Your scripts must be able to run on Speech Lab machines, so be sure that there are no version conflicts or other issues with your scripts.  Test them before submission on a lab machine such as vox, voix, veux, fluffy,…).



                        RecognizeConcepts.sh ~/test/test2.wav


Output example:

Departure city:



New York






We will test your system on the training data provided above as well as a test set in the train reservation domain spoken by the same speaker.  You will be graded based on grammar coverage and concept accuracy on the training and test data. Note that your system should be flexible enough to recognize the test utterances, which will be grammatical English sentences.



Submission:      You should submit 3 files (possible points for each in parentheses):

A readme.txt file that explains how to run your program with a command line example.  This file should also briefly explain  the coverage of your grammar and any heuristics you employed or other interesting aspects of your approach. (10 points)

gram: a file containing your grammar in the format specified above (20 points)

Your program that runs the ASR and extracts concepts (see II above) (20 points)

 (Include a make.sh file to compile your code if necessary.)


      *** Upload these files in one zip file USERNAME.zip (e.g., fb2175.zip) to courseworks



                   The remaining 50 points will be based on your system’s concept accuracy on the training and test data.