Research

Summary | Past and Present Projects | Recent Talks

Speaker State

Emotional Speech

Jennifer Venditti, Jackson Liscombe, and I have looked at methods of eliciting both subjective and objective judgments and of correlating judgments of single tokens on multiple emotion scale -- i.e., if subjects rate a token high for frustration, what other emotional states do they also rate it high for -- or low ("Classifying Subjective Ratings of Emotional Speech," Eurospeech 2003). We conducted eye-tracking experiments which allow us to compare subjective judgments to more objective cues to the decision process. We have also worked with colleagues at the University of Pittsburgh to study speaker state student speech in a tutorial system for emotional states such as anger, frustration, confidence and uncertainty ("Detecting Certainness in Spoken Tutorial Dialogues," INTERSPEECH 2005). We have also studied question form and function in this domain and performed machine learning experiments to identify Question-Bearing Turns, as well as question form and function, automatically (“Detecting question-bearing turns in spoken tutorial dialogues” and Intonational cues to student questions in tutoring dialogs”, INTERSPEECH 2006).  Agus Gravano, Elisa Sneed, Gregory Ward and I have also looked at intonational contour and syntactic construction in the conveyance of speaker certainty (The effect of contour type and epistemic modality on the assessment of speaker certainty”, Speech Prosody 2008), and Frank Enos and I have proposed a new methodology for eliciting emotional speech in A framework for eliciting emotional speech: Capitalizing on the actor's process”, LREC Workshop on Emotional Corpora.

Deceptive Speech

 Frank Enos, Stefan Benus, and I are working with colleagues at SRI/ICSI and the University of Colorado on automatic methods of distinguishing deceptive from non-deceptive speech ("Distinguishing Deceptive from Non-Deceptive Speech," INTERSPEECH 2005; Detecting deception using critical segments”, INTERSPEECH 2007).   For this work we collected and annotated a large corpus of deceptive and non-deceptive speech, the CSC Deception Corpus. We have also looked at the role of pausing in deception (Pauses in deceptive speech; Speech Prosody 2006) and examined the role of personality in the accuracy of human judges of deception (Personality factors in human deception detection: Comparing human to machine performance”, INTERSPEECH 2006).

Charismatic Speech

Andrew Rosenberg, Fadi Biadsy, and I are study the acoustic, prosodic, and lexical cues to charismatic speech in American English ("Acoustic/Prosodic and Lexical Correlates of Charismatic Speech",  INTERSPEECH 2005).  With Fadi Biadsy  we have extended our effort to include research on Palestinian Arabic, and with Rolf Carlson (KTH) and Eva Stangert (Umeå) we have investigated cross-cultural perceptions of charisma and their acoustic, prosodic and lexical features (A cross-cultural comparison of American, Palestinian, and Swedish perception of charismatic speech”, Speech Prosody 2008).

Speech Summarization and Distillation

 

With Sameer Maskey, Andrew Rosenberg, and Fadi Biadsy, I have worked on speech summarization, exploring new techniques which take advantage of prosodic and acoustic information, in addition to lexical cues and structural cues, in news broadcasts to 'gist' a broadcast (Automatic speech summarization of broadcast news using structural features”, EUROSPEECH 2003; "Comparing Lexical, Acoustic/Prosodic, Structural and Discourse Features for Speech Summarization," INTERSPEECH 2005; "Summarizing Speech without Text Using Hidden Markov Models," HLT/NAACL 2006; and “Intonational Phrases for Speech Summarization”, INTERSPEECH 2008).  We have also looked at the segmentation of news broadcasts into stories ("Story Segmentation of Broadcast News in English, Mandarin and Arabic" HLT/NAACL 2006), the determination of speaker roles (e.g. anchor, reporter, interviewee ) (See R. Barzilay et al., "Identification of Speaker Role in Radio Broadcasts", AAAI 2000 for earlier work.), and the extraction of soundbites from broadcasts (spoken ‘quotes’ included in a show) and identification of their speaker, .  “An unsupervised approach to biography production using Wikipedia”, ACL/NAACL 2008.   Elena Filatova, Martin Jansche, Mehrbod Sharifi, and Wisam Dakka are co-authors of some of this work also.

Spoken Dialogue Systems

The Columbia Games Corpus

Agus Gravano, Stefan Benus, and I have been collecting and analyzing a large corpus of spontaneous dialogues, produced by subjects playing a computer game we created.  We collected this data to test several theories of the way speakers produce ‘given’ (as opposed to ‘new’) information.  We are currently labeling this corpus for intonation, in the ToBI framework; we have also turn-taking behaviors, cue phrases, questions (identified as to form and function) and other aspects of the corpus.  This is joint work with Gregory Ward and Elisa Sneed at Northwestern University .   Michael Mulley also was an active participant in the design of the corpus.

Cue Phrases

Work on cue phrases, or discourse markers, is described in Julia Hirschberg and Diane Litman, "Empirical Studies on the Disambiguation of Cue Phrases," Computational Linguistics, 1992; some figures are missing in this version.  More recently Agus Gravano, Stefan Benus, Lauren Wilcox, Hector Chavez, and Shira Mitchell, Ilia Vovsha, and I have been looking at cue phrase production and detection in the Games corpus (On the role of context and prosody in the interpretation of okay”, ACL 2007; “Classification of discourse functions of affirmative words in spoken dialogue”, Interspeech 2007; “The prosody of backchannels in American English”, ICPhS 2007). 

Speaker Entrainment

Ani Nenkova, Agus Gravano and I are looking at various types of speaker entrainment in the Games Corpus (“High frequency word entrainment in spoken dialogue”, ACL 2008).  We are also examining acoustic/prosodic entrainment.

The Given/New Distinction

Agus Gravano, Ani Nenkova, Gregory Ward, Elisa Sneed and I have studied the different ways speakers produce ‘given’ vs. ‘new’ information in “Effect of genre, speaker, and word class on the realization of given and new information”, INTERSPEECH 2006 and “Intonational overload: Uses of the H* !H* L- L% contour in read and spontaneous speech”, Laboratory Phonology 9.

Misrecognitions, Corrections, and Error Awareness

Diane Litman, Marc Swerts and I have studied the prosodic consequences of recognition errors in Spoken Dialogue Systems. We are studying whether prosodic features of user utterances can tell us a) whether a speech recognition error has occurred, as a user reacts to it (e.g. System: "Did you say you want to go to Baltimore ?" User: "NO!"), or, b) whether a user is in fact correcting such a recognition error (e.g. User: "I want to go to BOSTON !". We have already found that prosodic features predict recognition errors directly with considerable accuracy in the TOOT train information corpus dialogues. Using machine learning techniques, we have found that, in combination with information already available to the recognizer, such as acoustic confidence scores, grammar, and recognized string, prosodic information can distinguish speaker turns that are misrecognized far better than traditional methods for ASR rejection using acoustic confidence scores alone. See Julia Hirschberg, Diane Litman and Marc Swerts, "Prosodic and Other Cues to Speech Recognition Failures," Speech Communication 2004. We have also studied user corrections of system errors in the TOOT corpus, finding also significant prosodic differences between corrections and non-corrections that can be used to predict when a user is correcting the system with some success; in addition we find interesting and useful correlations between system strategies and types of user corrections, as well as evidence for what types of corrections are more successful (see "Corrections in Spoken Dialogue Systems," ICSLP-00; "Identifying User Corrections Automatically in Spoken Dialogue Systems", NAACL-01; and “Characterizing and Prediction Corrections in Spoken Dialogue”, Computational Linguistics 32, 2006).

Predicting Prosodic Events

Intonational Variation in Synthetic Speech

Most of my early work on predicting intonational phrase boundaries and prominences was done in the Text-to-Speech synthesis group at Bell Labs.  Some papers describing that work are Philipp Koehn, Steven Abney, Julia Hirschberg, and Michael Collins, "Improving Intonational Phrasing with Syntactic Information," ICASSP-00; Julia Hirschberg and Pilar Prieto, "Training intonational phrasing rules automatically for English and Spanish Text-to-Speech," Speech Communication, 1996; Julia Hirschberg, "Pitch Accent in Context: Predicting Intonational Prominence from Text," Artificial Intelligence, 1993; and Michelle Wang and Julia Hirschberg, "Automatic Classification of Intonational Phrase Boundaries," Computer Speech and Language, 1992.   These methods were used to assign intonational variation automatically in the Bell Labs Text-to-Speech System.  I also collaborated  on two projects in concept-to-speech generation (generating speech from an abstract representation of the concepts to be conveyed). One, with Shimei Pan and Kathy McKeown of Columbia University , seeks to assign prosody appropriately for a multimodal medical application, MAGIC. Some results are documented in Shimei Pan, Kathy McKeown and Julia Hirschberg, "Semantic Abnormality and its Realization in Spoken Language," Proceedings of Eurospeech 2001, Aalborg. The other, with Srinivas Bangalore, Owen Rambow, and Marilyn Walker (AT&T Labs -- Research) involves prosodic assignment in the DARPA Communicator travel information domain. Some results appear in Julia Hirschberg and Owen Rambow, "Learning Prosodic Features using a Tree Representation," Proceedings of Eurospeech 2001, Aalborg.

Detecting Prosodic Events

More recent work on prosody detection has been done with Andrew Rosenberg, who has developed new ways to combine energy-based features with other acoustic and lexical features to achieve very high accuracy in prediction.  Papers documenting this work include (On the correlation between energy and pitch accent in read English speech”, INTERSPEECH 2006; and “Detecting pitch accent using pitch-corrected energy-based predictors”, INTERSPEECH 2007)

Audio Browsing and Retrieval

 

Work on our SCAN (Spoken Content-Based Audio Navigation) browsing and retrieval system is summarized in John Choi et al., "Spoken Content-Based Audio Navigation (SCAN)," ICPhS-99. This project combines ASR and IR technology to enable search of large audio databases, such as broadcast news archives or voicemail. It started life as `AudioGrep'. Current collaborators include Steve Abney, Brian Amento, Michiel Bacchiani, Phil Isenhour, Diane Litman, Larry Stead, and Steve Whittaker. My particular interests lie in the use of acoustic information to segment audio (Julia Hirschberg and Christine Nakatani, "Acoustic Indicators of Topic Segmentation," ICSLP-98) and the study of how people browse and search audio databases such as broadcast news collections (Steve Whittaker et al., "SCAN: Designing and Evaluating User Interfaces to Support Retrieval from Speech Archives ", SIGIR-99) and voicemail (Steve Whittaker, Julia Hirschberg and Christine Nakatani, "Play it again: a study of the factors underlying speech browsing behavior," and Steve Whittaker, Julia Hirschberg and Christine Nakatani, "All talk and all action: strategies for managing voicemail messages," both presented at CHI-98). We have also studied how differences in ASR accuracy (comparing 100%, 84%, 69%, 50% accuracy transcripts) affect users' ability to perform tasks, finding effects for transcript accuracy on time to solution, amount of speech played, likelihood of subjects abandoning transcript, and various subjective measures; however, our results hold only when we collapse our four categories into two; i.e., there are no differences between perfect and 84% accurate transcripts or between 69% and 50% accurate ones (Litza Stark, Steve Whittaker, and Julia Hirschberg, "ASR Satisficing: The effects of ASR accuracy on speech retrieval", ICSLP-00). Currently, in a new voicemail application, SCANMail, now in friendly trial, we have ported SCAN technology to the voicemail domain: users are able to browse and retrieve their voicemail by content. See J. Hirschberg et al., "SCANMail: Browsing and Searching Speech Data by Content Domain" and A. Rosenberg et al., "Caller Identification for the SCANMail Voicemail Browser" (both presented at Eurospeech 2001). Meredith Ringel and I have also worked on ranking voicemail messages as to urgency and distinguishing personal from business methods, using machine learning techniques ("Automated Message Prioritization: Making Voicemail Retrieval More Efficient," presented at CHI 2002).

Intonation and Discourse Structure

 

Some results of a long collaboration with Barbara Grosz and Christine Nakatani on the intonational correlates of discourse structure in read and spontaneous speech is described in "A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues," (ACL-96). The BDC corpus (with ToBI labels) is available here.  Results of earlier studies of read speech are described in "Some Intonational Characteristics of Discourse Structure," (a reformatted version of ICSLP-92).

Intonational Disambiguation

 

Empirical studies comparing the way native speakers of different languages employ intonational variation to disambiguate potentially ambiguous utterances are described in Julia Hirschberg and Cinzia Avesani, "The Role of Prosody in Disambiguating Potentially Ambiguous Utterances in English and Italian," ESCA Tutorial and Research Workshop on Intonation, Athens, 1997.

Disfluencies in Spontaneous Speech

 

Christine Nakatani and Julia Hirschberg, "A Corpus-based study of repair cues in spontaneous speech," JASA, 1994, describes studies of the acoustic/prosodic characteristics of self-repairs.

Labeling Conventions and Labeled Corpora

 

I have been an active participant in the development of the ToBI Labeling Standard for the prosodic labeling of Standard American English (see the ToBI conventions for a quick overview). . This standard was developed by a number of researchers from industry and academia and has been extended for other dialects of English and for other languages, including Italian, German, Spanish, Japanese and more. Interlabeler reliability ratings (see John Pitrelli, Mary Beckman, and Julia Hirschberg, "Evaluation of Prosodic Transcription Labeling Reliability in the ToBI Framework," Proceedings of the Third International Conference on Spoken Language Processing, Yokohama, September, 1994, pp. 123-126) are quite good and there are tools and training materials available with pdf and html versions and praat files there. There is also a Wavesurfer version and another Praat version with cardinal examples done by Agus Gravano and available from the Columbia ToBI site.  The Boston Directions Corpus (with ToBI labels) is available here.   

 

 

Julia Hirshberg Portrait

Julia Hirschberg
Professor, Computer Science

Columbia University
Department of Computer Science
1214 Amsterdam Avenue
M/C 0401
450 CS Building
New York, NY 10027

email: julia@cs.columbia.edu
phone: (212) 939-7114

Download CV

Columbia University Department of Computer Science / Fu Foundation School of Engineering & Applied Science
450 Computer Science Building / 1214 Amsterdam Avenue, Mailcode: 0401 / New York, New York 10027-7003
Tel: 1.212.939.7000 / Fax: 1.212.666.0140