My main area of research is computational linguistics, specifically the
relationship between intonation and discourse. My current interests include
emotional speech (including deceptive and charismatic speech); intonation
variation in spoken dialogue systems; speech synthesis; speech search and summarization over large
corpora of broadcast news and voicemail; and interfaces to speech corpora.
Below are some of my papers. A complete list of publications can be found in my
resume; if you have trouble finding any papers, please
send me email. Slides from a tutorial with a bibliography are
also available for download. Also, I have written several general surveys
of work on intonational meaning and text-to-speech synthesis, including "Communication and Prosody" in Speech
Communication 36 (2002), an article in the Handbook of Pragmatics, "Pragmatics
and Intonation," (2003), and a section of the second edition of the Encyclopedia
of Language and Linguistics on “Speech Synthesis, Prosody” (final draft).
for a fuller listing of publications from the Speech Lab. For a list of current and past collaborators
Jennifer Venditti, Jackson Liscombe, Agustin Gravano, and I have been
looking at various methods of eliciting both subjective and objective judgments
and of correlating judgments of single tokens on multiple emotion scale --
i.e., if subjects rate a token high for frustration, what other
emotional states do they also rate it high for -- or low ("Classifying Subjective Ratings of Emotional
Speech," Eurospeech 2003). We conducted some eye-tracking experiments
which allow us to compare subjective judgments to more objective cues to the
decision process. We are also working with colleagues at the
Agus Gravano, Stefan Benus, and I have been collecting and
analyzing a large corpus of spontaneous dialogues, produced by subjects playing
a computer game we created. We collected
this data to test several theories of the way speakers produce ‘given’
(as opposed to ‘new’) information.
We are currently labeling this corpus in ToBI and have identified
different turn-taking behaviors, cue phrases, questions (identified as to form
and function) and other aspects of the corpus.
This is joint work with Gregory Ward and colleagues at
Diane Litman, Marc Swerts and I have been working on the prosodic consequences of recognition errors in Spoken Dialogue Systems. We are studying whether prosodic features of user utterances can tell us a) whether a speech recognition error has occurred, as a user reacts to it (e.g. System: "Did you say you want to go to Baltimore?" User: "NO!"), or, b) whether a user is in fact correcting such a recognition error (e.g. User: "I want to go to BOSTON!". We have already found that prosodic features predict recognition errors directly with considerable accuracy in the TOOT train information corpus dialogues. Using machine learning techniques, we have found that, in combination with information already available to the recognizer, such as acoustic confidence scores, grammar, and recognized string, prosodic information can distinguish speaker turns that are misrecognized far better than traditional methods for ASR rejection using acoustic confidence scores alone. See Julia Hirschberg, Diane Litman and Marc Swerts, “Prosodic and Other Cues to Speech Recognition Failures,” Speech Communication 2004. We have also studied user corrections of system errors in the TOOT corpus, finding also significant prosodic differences between corrections and non-corrections that can be used to predict when a user is correcting the system with some success; in addition we find interesting and useful correlations between system strategies and types of user corrections, as well as evidence for what types of corrections are more successful (see "Corrections in Spoken Dialogue Systems", presented at ICSLP-00 and "Identifying User Corrections Automatically in Spoken Dialogue Systems"), presented at NAACL-01.
Work on our SCAN (Spoken Content-Based Audio Navigation) browsing and retrieval system is summarized in John Choi et al., "Spoken Content-Based Audio Navigation (SCAN)," ICPhS-99. This project combines ASR and IR technology to enable search of large audio databases, such as broadcast news archives or voicemail. It started life as `AudioGrep'. Current collaborators include Steve Abney, Brian Amento, Michiel Bacchiani, Phil Isenhour, Diane Litman, Larry Stead, and Steve Whittaker. My particular interests lie in the use of acoustic information to segment audio (Julia Hirschberg and Christine Nakatani, "Acoustic Indicators of Topic Segmentation," ICSLP-98) and the study of how people browse and search audio databases such as broadcast news collections (Steve Whittaker et al., "SCAN: Designing and Evaluating User Interfaces to Support Retrieval from Speech Archives ", SIGIR-99) and voicemail (Steve Whittaker, Julia Hirschberg and Christine Nakatani, "Play it again: a study of the factors underlying speech browsing behavior," and Steve Whittaker, Julia Hirschberg and Christine Nakatani, "All talk and all action: strategies for managing voicemail messages," both presented at CHI-98). We have also studied how differences in ASR accuracy (comparing 100%, 84%, 69%, 50% accuracy transcripts) affect users' ability to perform tasks, finding effects for transcript accuracy on time to solution, amount of speech played, likelihood of subjects abandoning transcript, and various subjective measures; however, our results hold only when we collapse our four categories into two; i.e., there are no differences between perfect and 84% accurate transcripts or between 69% and 50% accurate ones (Litza Stark, Steve Whittaker, and Julia Hirschberg, "ASR Satisficing: The effects of ASR accuracy on speech retrieval", ICSLP-00). Currently, in a new voicemail application, SCANMail, now in friendly trial, we have ported SCAN technology to the voicemail domain: users are able to browse and retrieve their voicemail by content. See J. Hirschberg et al., "SCANMail: Browsing and Searching Speech Data by Content Domain" and A. Rosenberg et al., "Caller Identification for the SCANMail Voicemail Browser" (both presented at Eurospeech 2001). Meredith Ringel and I have also worked on ranking voicemail messages as to urgency and distinguishing personal from business methods, using machine learning techniques ("Automated Message Prioritization: Making Voicemail Retrieval More Efficient", presented at CHI 2002).
Some results of a long collaboration with Barbara Grosz and Christine Nakatani on the intonational correlates of discourse structure in read and spontaneous speech is described in " A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues ," (ACL-96). The BDC corpus (with ToBI labels) is available here. Results of earlier studies of read speech are described in "Some Intonational Characteristics of Discourse Structure ," (a reformatted version of ICSLP-92).
Empirical studies comparing the way native speakers of different languages employ intonational variation to disambiguate are described in Julia Hirschberg and Cinzia Avesani, "The Role of Prosody in Disambiguating Potentially Ambiguous Utterances in English and Italian," ESCA Tutorial and Research Workshop on Intonation, Athens, 1997.
Christine Nakatani and Julia Hirschberg, "A Corpus-based study of repair cues in spontaneous speech," JASA, 1994, describes studies of the acoustic/prosodic characteristics of self-repairs.
Work on cue phrases, or discourse markers, is described in Julia Hirschberg and Diane Litman, "Empirical Studies on the Disambiguation of Cue Phrases"," Computational Linguistics, 1992; some figures are missing in this version). More recently Agus Gravano, Stefan Benus and I have been looking at cue phrase production in the Games corpus (see above).
I collaborated most recently on two
projects in concept-to-speech generation (generating speech from an abstract
representation of the concepts to be conveyed). One, with Shimei Pan and Kathy McKeown of Columbia
University, seeks to assign prosody appropriately for a multimodal medical
Some results are documented in Shimei Pan, Kathy McKeown and Julia Hirschberg, "Semantic Abnormality and its Realization in
Spoken Language," Proceedings of Eurospeech 2001,
Philipp Koehn, Steven Abney, Julia Hirschberg, and Michael Collins, Improving Intonational Phrasing with Syntactic Information", to appear in ICASSP-00. Julia Hirschberg and Pilar Prieto, "Training intonational phrasing rules automatically for English and Spanish Text-to-Speech", Speech Communication, 1996. Michelle Wang and Julia Hirschberg, " Automatic Classification of Intonational Phrase Boundaries," Computer Speech and Language, 1992.
See Julia Hirschberg, "Pitch Accent in Context: Predicting Intonational Prominence from Text," Artificial Intelligence, 1993.
I have been an active participant in the development of the ToBI Labeling Standard for the
prosodic labeling of Standard American English (see the ToBI conventions for a quick overview). . This
standard was developed by a number of researchers from industry and academia
and has been extended for other dialects of English and for other languages,
including Italian, German, Spanish, Japanese and more. Interlabeler reliability
ratings (see John Pitrelli, Mary Beckman, and Julia Hirschberg,``Evaluation
of Prosodic Transcription Labeling Reliability in the ToBI Framework,''
Proceedings of the Third International Conference on Spoken Language
Processing, Yokohama, September, 1994, pp. 123-126) are quite good and there
are tools and training
materials available with pdf and html versions and praat files there. There
is also a Wavesurfer version and another Praat version with cardinal
examples done by Agus Gravano
and available from the Columbia