\documentstyle{icphsart}
\begin{document}
\bibliographystyle{IEEEbib}
\pagestyle{empty}

\title{Intonation and discourse processing}
\author{  \textbf{Jennifer J.~Venditti} and  \textbf{Julia Hirschberg} 
\\[0.5em]
	Columbia University\\
	{\tt \{jjv,julia\}@cs.columbia.edu}\\
          }

\maketitle

\begin{abstract}
This paper describes intonational cues to discourse structure, and the
role that intonation plays in spoken discourse processing. We begin by
discussing two main structures in discourse that one must consider
when doing research on discourse processing: {\it segmentation} and
{\it information status}. We then review a number of key studies from
the phonetics literature which have investigated the intonational
marking of these structures. Next, we discuss in detail the
psycholinguistic research to date which has examined the role that
intonation can play in facilitating or inhibiting the processing of
discourse in English and other related languages. We conclude by
outlining directions for future research in the area of
intonation-discourse processing.
\end{abstract}

%=================================================================
\vspace*{-.2cm}
\section{Introduction}
\label{sec:intro}

Intonation is an integral part of every spoken language utterance. It
can provide cues to the linguistic structure of the speaker's message,
her emotional state, or her communicative intent. Despite this wealth
of information available in the signal, surprisingly little is known
about how listeners might go about integrating this information into
their interpretation of an utterance. The goal of this special session
is to discuss the contributions of intonation to spoken language
processing. In their paper, Speer et al.\ \cite{speer-etal03} describe
how intonation can be used in the parsing of syntactic structure. Our
paper will concern the processing of another type of linguistic
structure, namely the discourse structure.\footnote{For issues
concerning the parsing of intonational structure itself, see Beckman
\cite{beck96}.  Readers are also referred to Cutler et al.'s
\cite{cutler-etal97} extensive review of prior literature on
intonation in spoken language understanding, including discourse
understanding.}

In order to talk about the role of intonation in discourse processing,
we must first clarify what we intend by the terms {\it discourse} and
{\it processing}. By {\it discourse}, we mean not only aspects of
linguistic structure above the sentence level, such as paragraph or
topic structure and dialog turns, but also the dynamic shifts in
information status, including salience, focus of attention, and the
{\it given/new} distinction.  In Section~\ref{sec:disc} we will
outline a model of discourse structure, proposed by Grosz \& Sidner
\cite{grosz-sid86}, which incorporates these two aspects into a
unified account of discourse structuring. The term {\it processing}
can mean a variety of things, from off-line comprehension of and
judgments about an utterance's structure and meaning, to the on-line
moment-by-moment interpretation of that utterance, to the implicit
`workload' that is associated with such interpretation. We will
attempt to address all of these aspects of processing in our review of
the literature. In Sections~\ref{sec:segment} and \ref{sec:salience}
we will discuss key studies which have investigated the
intonation-discourse interface. We will include selected works from
the vast phonetics literature on this topic, and integrate it with the
small but growing literature from the psycholinguistics community on
the role of intonation in discourse processing. We will conclude in
Section~\ref{sec:future} by offering a few suggestions for areas of
research that we find particularly intriguing and fruitful for future
research on intonation and discourse processing.

%=================================================================
\vspace*{-.2cm}
\section{Discourse Structuring}
\label{sec:disc}

Research on the intonation-syntax interface has shown that intonation
can play an important role in cueing structures such as clause
boundary location, PP and relative clause attachment, and NP
bracketing (e.g.\ {\it old men and women}). In the discourse domain,
what structures are important in processing, where do the potential
ambiguities lie, and what intonational cues may potentially
disambiguate them? In this section we will discuss two main aspects of
discourse structuring: (i) {\it segmentation} and {\it hierarchy}, and
(ii) {\it information status}, including salience, focus of attention,
and {\it given/new} information. While there are many theoretical
accounts of these phenomena, we will focus here on how these are
represented in the theory of discourse structuring and coherence
proposed by Grosz \& Sidner \cite{grosz-sid86}, which has been widely
used in both computational and experimental research (e.g.\
\cite{naka97DISS,gordon-etal93,dzmura-tanen98}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Segmentation}
\label{subsec:seg}

Most researchers now assume that a spoken (or written) discourse is
more than just a string of utterances, but that individual utterances
of a discourse are grouped into higher-level units.  In order to
characterize these units, previous studies have used written cues such
as `paragraphs', or labelers' or researchers' intuitive notions of
topic structure in the data under analysis.  Often, the discourse
structure description is study-specific and cannot be generalized to
other data. What is generally lacking is an independently-motivated
theory of discourse structuring, which can be empirically determined
(by trained labelers) in a reliable manner.

One such independent theory is the intention-based proposal by Grosz
\& Sidner \cite{grosz-sid86}. Under this proposal, utterances are
grouped into cohesive units known as {\it discourse segments} (DS),
which serve as the building blocks that make up the
discourse. Utterances grouped in a DS share a common property: they
all contribute to the overall {\it purpose} or {\it intention} that a
speaker has for producing that particular segment. The purposes of the
segments ({\it discourse segment purposes} or `DSPs') then contribute
to the overall purpose of the discourse (the {\it discourse purpose}
or `DP'). In other words, a speaker generally has a reason for
producing a discourse.  Individual utterances contribute to the DSPs
of the segments to which they belong, which in turn contribute to the
overall DP.  In addition, in this theory a DS is related to other DSs
in one of two ways: by {\it dominance}, a hierarchical relationship in
which the purpose of the dominated segment contributes to the purpose
of the dominating segment) or by {\it satisfaction-precedence}, a
linear relationship in which one DSP must be satisfied {\bf before}
the DSP of another \cite{grosz-sid86,naka-etal95}.

This theory of discourse organization has been put to the test by
asking human labelers to segment speech corpora using the segmentation
guidelines developed by \cite{naka-etal95}. Such studies have
demonstrated a high degree of inter-labeler reliability
\cite{grosz-hirsch92,hirsch-naka96,naka97DISS}. Research using this
method of segmentation has shown that DS boundaries may be marked by
linguistic means such as specific lexical items known as {\it cue
phrases} (e.g.\ {\it so}, {\it next}, {\it finally}, etc.)
\cite{hirsch-lit87}, or shifts in tense, but that such cues may not
always be present. In Section~\ref{sec:segment} we will summarize
recent research investigating intonational cues to such
intention-based segmentation.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Information Status}
\label{subsec:sal}

As with discourse segmentation, numerous theoretical constructs have
been proposed in the literature to capture the notion that discourse
entities change their status over the course of a discourse, from new
information to old (or {\it given}) information, from focus to
background.  These changes in information status are closely related
to the {\it accessibility} of individual entities as they are referred
to by discourse participants. The notions of {\it given} vs.\ {\it
new} (see e.g.\
\cite{halliday67a,Chafe74,chafe76,clark-havi77,prince81}), whereby
entities newly introduced into the discourse are considered {\it new},
and those already in the discourse context are {\it given}, while
widely invoked in both theoretical and psycholinguistic research, are
notoriously difficult to define. What does it mean for a {\it given}
entity to be `already in the discourse context'? A number of solutions
to this problem have been proposed, including Halliday's notion of
`recoverability' \cite{halliday67a}, Chafe's definition with respect
to the listener's `consciousness' \cite{Chafe74,chafe76}, and Prince's
multi-dimensional taxonomy \cite{prince81,prince92}. Here, we choose
to focus on how {\it given/new} might be represented in Grosz \&
Sidner's theory of intention-based discourse structuring.

As mentioned in the previous section, a discourse is composed of a
number of discourse segments (DS), each with its own purpose.  This
linguistic structure interfaces with another important structure,
namely the {\it attentional state} of the discourse. According to
Grosz \& Sidner's proposal, the onset of each DS opens up a new {\it
global focus space} in the ever-evolving record of speaker and
hearer's attentional state, to which discourse entities may be added
as they are referred to. For example, if a speaker utters {\it Now I
will build a house}\footnote{See discussion of house-building
experiments by Terken \cite{terk84} and Swerts \& Geluykens
\cite{swerts-geluy94} in Sections~\ref{sec:segment} and
\ref{subsec:prod}, respectively.}, the cue word {\it now} signals that
a new DS has begun, and a corresponding focus space is added to the
representation of the attentional state, into which the discourse
entity representing {\it house} will be inserted. Under this approach,
when an entity such as {\it house} is first added to discourse's focus
space, we might say that it is also considered {\it new}
information. Once introduced, a now {\it given} entity may remain {\it
salient} and {\it accessible} in the discourse, or it may lose its
salience and accessibility as the discourse proceeds
\cite{grosz-sid86,naka97DISS,ven00}. Thus, we may view Grosz \&
Sidner's notion of {\it attentional focus} as modeling two kinds of
information status: both the {\it given/new} distinction and the
notion of {\it salience} or {\it accessibility}.  So in this way, {\it
given} information is no longer defined as `mentioned within the last
{\it N} utterances', but rather is directly related to an entity's
status with respect to the dynamic representation of attentional
state.  In Section~\ref{sec:salience} we will describe how this
approach to modeling information status can account for the
distribution of intonational prominences in spoken discourses.

%=================================================================
\vspace*{-.3cm}
\section{Intonation in Discourse Segmentation}
\label{sec:segment}

There is considerable evidence in the literature that intonational
features can signal the structuring of utterances into larger
discourse segments.\footnote{In this paper we will focus on the use of
fundamental frequency (F0) in cueing discourse structure. Readers are
referred to the literature for descriptions of other acoustic cues
that may be used. Also, our focus will primarily be on English and
languages with similar intonational systems (e.g.\ Dutch). Studies of
the intonation-discourse interface in other languages (e.g.\ Japanese,
see \cite{ven00,ven-swerts96}) have had similar findings, though the
intonational means used to cue discourse structures are different.} In
an early study, Lehiste used written `paragraphs' as the discourse
unit of interest \cite{lehiste75}. She found that English utterances
with high F0 peaks are perceived by listeners as being paragraph
initial.  Paragraph-medial and final utterances tend to have lower F0
peaks. In another study, Silverman found that manipulating the pitch
range of intonation phrases in English using resynthesis can cause
listeners to segment discourses with ambiguous structures differently:
phrases with an expanded pitch range are likely to be judged as
paragraph initial, while final lowering can cue paragraph finality
\cite{silver87}.

Other studies have defined discourse units in terms of `topic'
structure: stretches of speech in which the speaker is mainly
discussing a single entity. Yule suggested that intonation can be used
to mark the boundaries of topic units in English spontaneous speech:
a structure which he termed the `paratone' \cite{yule80}. Swerts \&
Geluykens examined topic units in Dutch, defined in their
house-building task as a stretch of speech in which a specific piece
of the house is being described \cite{swerts-geluy94}. They also found that
F0 is high at the beginning of such units, and gradually declines to
the unit end.

In task-oriented discourses and similar genres, the notion of `topic
units' may be equated to the intention-based discourse segments
defined by Grosz \& Sidner \cite{grosz-sid86,naka-etal95}. For
example, in Swerts \& Geluykens's [21] house-building task, the topic
unit which describes the construction of the {\it front door} is
likely to be the same as the intention-based DSP `Tell the listener
how to construct the front door' in the Grosz \& Sidner framework.  An
advantage of Grosz \& Sidner's intention-based discourse segmentation
scheme is that it can describe other discourse genres as well. The
relation of intonation to discourse segmentation within this framework
has been studied extensively by Hirschberg and colleagues (e.g.\
\cite{HP86,grosz-hirsch92,hirsch-naka96,naka97DISS}), in various
discourse genres. Hirschberg found that increased F0 values (both
maximum and mean F0) are characteristic of (intermediate) prosodic
phrases which labelers agree to be discourse segment initial, relative
to other phrases in the database. Likewise, lower F0 values relate to
DS-medial and DS-final judgments
\cite{grosz-hirsch92,hirsch-naka96}.\footnote{These studies documented
a number of other acoustic-prosodic features which are also reliably
related to DS position, such as amplitude, speaking rate, and pause
durations.}

Given that a number of phonetic studies, using a variety of
theoretical constructs, have found that discourse segmentation can be
cued by intonational means, the question of most relevance to
psycholinguists is to what extent this segmentation can occur on-line,
as the listener processes the incoming spoken discourse. Finding that,
for example, DS-initial phrases are uttered with a higher F0 in
comparison with other phrases can be a tremendous aid in speech
synthesis, where such phrases can be systematically distinguished from
an otherwise `default' F0 topline (see e.g.\ \cite{HP86}). However,
such a finding does not necessarily shed light on whether listeners
(or real-time automatic recognition systems) can use this same
intonational information in on-line discourse segmentation. In order
to address this concern, Hirschberg \& Nakatani examined the {\it
relative change} in F0 (and other parameters) over a local window of
two consecutive phrases \cite{hirsch-naka96}. They found that the
previously-reported effects of DS position are true even at this more
local level. That is, there is a significant increase in F0 change
from one phrase to the next when the second phrase is DS-initial, in
comparison to when that phrase is medial or final. In addition,
DS-medial phrases are marked by an increased F0 change in comparison
with final phrases. These findings suggest that discourse segmentation
could in many cases be accomplished on-line as the discourse unfolds,
by examining the local change in overall phrasal pitch range from one
intonation phrase to the next.

%=================================================================
\vspace*{-.2cm}
\section{Intonation and Information Status}
\label{sec:salience}

There has also been considerable research on the role intonation can
play in cueing information status. Speakers can indicate the salience
or accessibility of a discourse referent by varying the intonational
prominence of referring expressions. This is accomplished by {\it
pitch accents} in languages such as English and Dutch, though other
languages may use different means. (For example, see research by Kang
\cite{kang96} on Korean and Venditti and colleagues
\cite{ven00,ven-swerts96} on Japanese suggesting that local {\it pitch
range} and/or {\it phrasing} variations can cue information status in
these languages.)

\subsection{Intonational Marking}
\label{subsec:prod}

The {\it given/new} distinction often cited in the discourse
literature was defined by Halliday directly in terms of the speaker's
choice of intonational form \cite{halliday67a}. For Halliday, {\it
new} information is focal information which ``the speaker presents
...\ as not being recoverable from the preceding discourse''
(regardless of whether or not it had been mentioned before)
\cite[p.~204]{halliday67a}, and is marked in English by a `tonic' or
`nuclear' pitch accent. While Halliday's claim is that the {\it
given/new} distinction is defined solely by the speaker's choice of
intonational grouping and prominence, subsequent studies have
attempted to relate the intonational phenomena to independent
text-based characterizations of {\it given} vs.\ {\it new}
information. For example, Brown \cite{brown83} used Prince's
\cite{prince81} taxonomy of discourse givenness to describe variations
in intonational prominence in English task-oriented speech. She found
that speakers tend to place pitch accents on {\it new} information,
while marking {\it given} information by deaccenting. However, Brown
also points out an instance in her data in which a {\it given} entity
is re-introduced into the discourse after some digressions, and is
marked by a pitch accent.\footnote{The accenting of re-introduced
entities has also been observed by Hirschberg \cite{hirsch93}.} Using
only Prince's taxonomy of givenness, along with a direct mapping of
these categories to intonational prominence markings, Brown cannot
account for such accenting of re-introduced entities. However,
approaches using the notion of a cache/buffer of a fixed number of
utterances may be able to capture such phenomena. That is, the entity
is no longer salient if the number of utterances defined by the cache
size have intervened. But what cache size is appropriate?  And is the
same size appropriate for all discourse
situations?\footnote{Cf. Cahn's \cite{Cahn98} recent work on
memory-based salience.}

The use of topic-based discourse segmentation is one way to better
define what it means for an entity to be {\it given}. Terken
\cite{terk84} examined accent distribution in Dutch house-building
monologues, using a topic unit defined as a stretch of speech in which
a specific piece of the house is being described (see also Swerts \&
Geluykens \cite{swerts-geluy94} mentioned above). He found that both
topics and non-topics are newly introduced using accented full NPs
(97\% and 81\%, respectively). This is consistent with the accenting
of {\it new} entities. However, Terken observed that the realization
of later mentions (within the topic unit) depends on the topic status
of the entity: topics are mainly realized by unaccented pronouns
(51\%), but accented and unaccented full NPs are also found (33\% and
5\%, respectively). Later mentions of non-topics, on the other hand,
are primarily realized by accented full forms (74\%), though
unaccented full forms exist as well (18\%). These results suggest that
while there is a general relationship between {\it given/new} (as
defined by topic unit segmentation) and pitch accenting, there are
additional factors which also affect accent distribution. We will
return to this issue below.

Hirschberg \& Pierrehumbert \cite{HP86} suggest that the notions of
{\it given} and {\it new}, and their relation to pitch accenting, can
be explained by a model of global attentional salience such as that
proposed by Grosz \& Sidner \cite{grosz-sid86}. Working within this
framework, Nakatani \cite{naka97DISS,naka97PISS} observed that
entities which are first introduced into the current global focus
space (which models the current intention-based DS) tend to be
realized with accented referring expressions, while those entities
already existing in the space (and hence globally salient) tend to be
realized with unaccented expressions. Nakatani also notes that
``references to entities that are either in a neighboring focus space
on the focus stack, or in the most recently popped focus space, [also]
do not require accentual prominence'' \cite[p.~149]{naka97PISS}.

To the extent that intention-based discourse segmentation may in many
cases correspond closely to the topic-based segmentation in Terken's
house-building discourses, Terken's results can be directly
interpreted in terms of this new approach. In addition, Nakatani's
observations using Grosz \& Sidner's model can account for two of
Terken's `exceptions': reference to the entity {\it house} using a
non-prominent expression, and decaccenting of some referents when the
antecedent is in the previous topic unit. In the first case, Terken
notes that ``expressions referring to the house itself are often
deaccented, even though the house has not been mentioned over long
stretches of discourse'' \cite[p.~280]{terk84}. One possible
explanation for this is that the entity {\it house} could reside in
the global focus space in the representation of the discourse's
attentional state, for example, due to its mention in a superordinate
DS whose purpose is to {\it Explain how to assemble the house}. If
this is the case in Terken's data, then Grosz \& Sidner's model of
global focus would characterize this entity as being in non-immediate
global focus, and this would license the use of an unaccented
expression to refer to the {\it house} in subsequent embedded
segments. In the second case of `exceptions', Terken observes
deaccenting of some referents across topic unit boundaries. Although
Terken does not describe these exceptions in full detail, it seems
that they could be an instance of the same phenomenon observed by
Nakatani \cite{naka97DISS} (and independently by Grosz \& Sidner
\cite{grosz-sid86} and Davis \& Hirschberg \cite{davis-hirsch88}) ---
namely, that an entity in a just-completed (or `popped') DS can be
still salient and thus does not need to be accented when mentioned in
the next sister DS. This model of salience and accessibility could
also explain the apparent `exception' noted independently by Brown
\cite{brown83} and by Hirschberg \cite{hirsch93}, that {\it given}
entities which are re-introduced into the discourse are marked by
pitch accents. If the previous mention of the entity occurred in a
non-adjacent and non-embedding DS, this would warrant re-introduction
using a pitch accent under this account. Thus, Grosz \& Sidner's
dynamic model of global focus driven by intention-based discourse
structuring can provide a rich architecture in which to examine
patterns of accentuation in both naturally-occurring and experimental
data.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Processing}
\label{subsec:proc}

While information status has been shown in a number of studies to
strongly influence pitch accent distribution in languages such as
English and Dutch, to understand the role of intonation in spoken
discourse processing, we must also investigate whether listeners are
in fact sensitive to such markings. In this section we discuss studies
which suggest that accentuation does indeed play a role in processing.

An early study by Most \& Saltz \cite{most-saltz79} asked listeners to
choose which of two {\it wh}-questions an intoned target answer would
be an appropriate reply to. They found that listeners' choice of
matching questions was related to the accentuation in the target
answer. For example, an answer such as {\it The MECHANIC fixed the
car} was taken to be the answer to {\it Who fixed the car?}, rather
than to {\it What did the mechanic fix?}. Birch \& Clifton
\cite{birch-clift95} also examined the effect of accentuation in
processing question-answer pairs. They used both `makes sense'
judgments (i.e.\ listeners provided speeded judgments of whether an
answer made sense given the question) and prosodic appropriateness
ratings of answers with varied accentuation patterns. They found that
answers in which {\it new} information was accented and {\it given}
information deaccented were not only rated as more appropriate by
listeners, but were understood more quickly in speeded judgments.

In another study, Bock \& Mazzella \cite{bock-mazz83} used a
comprehension time paradigm to determine the effect of appropriate
accentuation on processing of so-called denial-counterassertion pairs
such as {\it Arnold didn't fix the radio. Doris fixed the radio}. They
found that comprehension times of the target counterassertions were
shorter when focal (i.e.\ {\it new}) information was accented,
compared to non-focal information, suggesting that appropriate
accentuation facilitates comprehension in these utterances. More
recently, Davidson \cite{davidson01} used a phoneme-monitoring
paradigm to demonstrate that listeners use accentuation patterns in
denials to direct attention to alternatives presented in the
counterassertion, consistent with Bock \& Mazzella's findings.

% [ADD HERE SOMEWHERE IN MORE DETAIL -- RIGHT NOW IN FOOTNOTE]
% Nooteboom \& Kruyt \cite{noote-kruyt87}: accenting of given 
% info is more acceptable than deaccenting of new. 
% Also Nooteboom \& Terken %\cite{noote-terk82}

Terken \& Nooteboom \cite{terk-noote87} have also demonstrated that
listeners expect {\it new} information to be pitch accented and {\it
given} information to be deaccented (see also
\cite{noote-kruyt87}). Inappropriate accentuation on target words
slowed verification latencies in these experiments: that is, accenting
{\it given} information slowed reaction times, as did deaccenting {\it
new} information. In their experiments, Terken \& Nooteboom defined
{\it given} information as an NP whose antecedent was in the same
grammatical role either (a) in the immediate preceding utterance only,
or (b) in a number of preceding utterances. The effect of accentuation
on the processing of {\it given} information was the same for either
definition of {\it given}. Under the model of attentional focus
described in Sections~\ref{subsec:sal} and \ref{subsec:prod} above,
both definitions would predict that the {\it given} NP is in global
focus, consistent with the findings of production studies using this
framework.

Many of the processing studies examining the relationship between
accentuation and information status have used experimental paradigms
that probe so-called `off-line' comprehension. A very recent study by
Dahan et al.\ \cite{dahan-etal02} suggests that accentuation also
affects referential interpretation at very early stages of processing,
even before the entire target word has been heard. Dahan et al.\ used
eye-tracking to monitor listeners' fixations on pictured entities as
they heard simple pre-recorded instructions to manipulate the entities
on a computer screen, as in {\it Put the candle/candy below the
triangle. Now put the CANDLE/candle above the square}. Eye-tracking
has become a popular methodology for investigating real-time spoken
language processing, since it allows experimenters to monitor (visual)
attention to referents without interrupting the speech stream (unlike
the gating paradigm, for example), and because eye movements have been
found to be closely time-locked to the auditory input in such tasks
(see \cite{tanen-etal95} for a brief introduction to the eye-tracking
paradigm). In Dahan et al.'s study, the visual scene contained various
objects, two of which shared the same primary-stressed first syllable
(e.g.\ {\it candle} and {\it candy}). The first part of the auditory
instruction introduced either the {\it candle} or the {\it candy} into
the discourse context (see example above), establishing it as {\it
given}. The second part of the instruction then referred to the {\it
CANDLE/candle} using either an intonationally prominent or
non-prominent surface form.\footnote{In Dahan et al.'s stimuli,
prominent expressions were marked by H* or L+H* pitch accents, and
non-prominent (or `deaccented') expressions were marked by downstepped
H+!H* (and not by total deaccenting). Pierrehumbert \& Hirschberg
\cite{PH90} and Ayers \cite{ayers96} have also observed that
downstepped accents may sometimes be functionally similar to
deaccenting.} Note that the target noun in the second instruction is
(crucially) temporarily ambiguous during the first syllable [k\ae n],
and thus both {\it candle} and {\it candy} are potential referents at
this stage. Dahan et al.'s results showed clear effects of
accentuation on reference resolution. They found that, while listeners
eventually did fixate on the target noun (i.e.\ {\it candle}), which
was uniquely identifiable in its full form, fixations on the
competitor (i.e.\ {\it candy}) differed significantly depending on
accentuation: when {\it candle} had been mentioned in the first
instruction, there were more fixations on {\it candy} when [k\ae n]
was accented. Likewise, when {\it candy} was previously mentioned,
there were more fixations on {\it candy} when [k\ae n] was
deaccented. That is, listeners took accented [k\ae n] to refer to {\it
new} information, and deaccented [k\ae n] to refer to {\it given}
information, even at the earliest stages of lexical access. This
confirms that accentuation can indeed be reliably used by listeners to
process discourse representations, both in global (off-line)
comprehension, as well as on-line, as a discourse is unfolding.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Property-sharing Constraints}

A discussion of the relation between information status and
accentuation would not be complete without mention of one factor which
has come to the forefront in recent studies: the fact that {\it
given/new} interacts with {\it property-sharing constraints} in the
distribution of accents in discourse. In many previous studies, this
factor has either been overlooked or implicitly controlled for (e.g.\
by placing target and antecedent in the same grammatical role). In a
few studies, this factor has been systematically varied, with
revealing results.

Terken \& Hirschberg \cite{terk-hirsch94} examined the distribution of
accents in elicited spontaneous descriptions, and found that prior
mention (even in the immediately preceding utterance within the same
discourse segment) is not a sufficient predictor of deaccenting. The
target and antecedent must in addition share the same grammatical role
to warrant deaccenting (see also \cite{noote-terk82}). This importance
of property-sharing (here, grammatical role) was also demonstrated by
one of the experiments reported by Dahan et al.\
\cite{dahan-etal02}. Using the same eye-tracking paradigm and
experimental task as their study reported above, they examined
instruction sequences in which the target and antecedent did not share
the same grammatical/thematic role, as in {\it Put the necklace below
the candle. Now put the CANDLE above the square}. Analysis of eye
fixations revealed that there were no competitor effects in this
condition, as were observed in the conditions in which the target and
antecedent shared the same grammatical role. In other words, the
accented [k\ae n] was immediately interpreted as referring {\sc not}
to {\it new} information (which would have led to fixations on the
competitor {\it candy}), but to {\it given} information which was
realized in a {\bf different} (non-focused) grammatical position.

These and other studies show that the distribution of accents in
discourse depends on more than just the {\it given/new}
distinction. Not only are the notions of {\it given/new} notoriously
tricky to define, but even in the clearest cases of {\it given}
referring to information mentioned in the immediately previous
utterance and {\it new} referring to unmentioned information, there
are other constraints such as sharing of grammatical/thematic role or
surface position that complicate matters. Another factor which Terken
\& Hirschberg briefly touch upon but reserve for future research is
the possibility that this observed persistence of grammatical role
``may arise only due to the syntactic parallelism of successive
utterances in [their] context and target utterances''
\cite[p.~142]{terk-hirsch94} --- perhaps a more restrictive notion of
`property-sharing'.

In a series of recent eye-tracking studies, Venditti et al.\
\cite{ven-etal02,ven-etal03} demonstrated that syntactic parallelism
has a significant effect on the interpretation of (ambiguous)
nuclear-accented pronouns. They found that while accented pronouns
serve to shift reference in parallel constructions, such as in the now
infamous {\it John hit Bill and then HE hit George}, listeners had
difficulty interpreting accented pronouns in non-parallel sequences
(e.g.\ {\it John hit Bill and then HE ran away}). Moreover, even in
parallel constructions, significant preference for switched reference
(as indicated eye fixations) only emerged after listeners had heard
the (identical) verb, which provided strong evidence for syntactic
parallelism. Since the auditory stimuli in Dahan et al.'s
\cite{dahan-etal02} study (see examples above), and the spontaneous
productions in Terken \& Hirschberg's \cite{terk-hirsch94} study
(e.g.\ {\it The ball touches the diamond. The ball touches the star.})
involved not only sharing of grammatical role but also syntactic
parallelism, more research is needed to clarify which of these factors
(or both) are responsible for the observed patterns/effects of
accentuation, as Terken \& Hirschberg point out.

%=================================================================
\vspace*{-.2cm}
\section{Future Directions}
\label{sec:future}

In this paper, we have summarized the current state of knowledge of
the intonation-discourse interface, and the role intonation can play
in discourse processing. There is much more work to be done. In this
section we outline a number of intriguing directions open for future
research in this area.

{\bf Processing in dynamic models of attention and discourse
salience.} Much of the discourse processing literature has focused on
the {\it given/new} distinction as defined by adjacent utterance
pairs. We have briefly described Grosz \& Sidner's \cite{grosz-sid86}
model of intentional structuring and attentional focus which has
been used in a number of production studies examining the
intonation-discourse interface. Such a model will allow future studies
to investigate the more dynamic aspects of processing across extended
stretches of discourse.

{\bf Property-sharing, parallelism, and other constraints on
accentuation.} Although a number of studies have observed that {\it
given} information often does bear intonational prominence, the extent
to which factors such as property-sharing or parallelism (however
defined) can explain accent distribution has yet to be fully
investigated in either production or processing studies. Other factors
that must also be addressed with respect to this issue include: the
asymmetry between nuclear and pre-nuclear accents in marking {\it
given} information, and the functional similarity of downstepped
accents and deaccenting (see e.g.\ \cite{dahan-etal02,PH90,ayers96}).

{\bf The time course of integration of intonational information.} Most
of the intonation-discourse processing studies to date have involved
off-line comprehension or appropriateness judgments. Notable
exceptions are Dahan et al.'s \cite{dahan-etal02} and Venditti et
al.'s \cite{ven-etal02,ven-etal03} experiments using eye-tracking,
which were able to probe on-line integration of intonational
information as a discourse unfolds in real-time. More studies are
needed to investigate the exact time course of such information
integration. A number of previous studies have suggested that
intonational information occurring even prior to the event of interest
can be used by listeners in processing. For example, Cutler
\cite{Cutler76} showed that an intonation contour leading up to a
target word which was consistent with that word being accented
(although the word itself was a neutral version spliced in) resulted
in a phoneme-monitoring response time advantage. In Terken \&
Nooteboom's study \cite{terk-noote87} described above, they suggest
that an effect of grammatical role on observed reaction times might be
due to facilitation by the preceding intonation contour. Since their
predicate target NPs were all preceded by a falling pitch movement
(which in Dutch typically signals that the remaining portion of the
utterance contains non-focal information), they suggest that listeners
were likely able to identify the predicate NP as deaccented (thus {\it
given}) {\bf before} the NP itself was even uttered. These and other
studies (see e.g.\ Bock \& Mazzella's \cite{bock-mazz83} and
Davidson's \cite{davidson01} discussions of accentuation facilitating
comprehension of subsequent information in denials) underscore the
importance of more research on the time course of integration of
anticipatory and other intonational cues in real-time discourse
processing.

{\bf Cross-linguistic perspectives on intonation and discourse
processing.} Our discussion here has focused primarily on the role of
intonation in processing English and Dutch discourses. Research shows
that in these languages, variations in pitch range can aid in
discourse segmentation, and pitch accent distribution has a strong
influence on the processing of information status. Much work needs to
be done to investigate how these and other discourse structures are
cued in languages which don't have pitch accent systems like those of
English and Dutch. For example, Kang \cite{kang96} has observed that
speakers can use pitch range and accentual phrasing (among other
acoustic features) to mark information status in Korean
discourses. Venditti and colleagues \cite{ven-swerts96,ven00} also
found that systematic variation in pitch range can mark
intention-based structuring, information status, and topic transitions
in Japanese. The next step is to extend these production results to
studies of spoken discourse processing in a range of languages with
varied intonation systems.

{\bf The role of ambiguity `awareness' and experimental design.} The
many experimental studies reviewed here have demonstrated that
intonation can play a major role in discourse processing, both in
discourse segmentation as well as processing information
status. However, most of these studies have involved highly-structured
laboratory experiments, or have examined the speech of trained
speakers --- two factors which may have inadvertently inflated the
influence of intonation. In their paper in this special session, Speer
et al.\ \cite{speer-etal03} describe recent studies on syntactic
processing which have questioned the extent to which intonational cues
are reliably produced or used by na\"{\i}ve speakers/listeners. For
example, Allbritton et al.\ found that speakers could only reliably
produce prosodic cues to disambiguate certain syntactic structures
when the ambiguity was pointed out \cite{allbrit-etal96}. Snedeker \&
Trueswell also found that speakers only provided reliable intonational
cues when they were aware of the ambiguity, but did not produce
disambiguating cues when the two competing structures were manipulated
in a between-subjects design (that is, a given subject only
encountered one version of the structure) \cite{sned-trues03}. In
contrast, Speer et al.\ report that na\"{\i}ve subjects could indeed
produce reliable cues in their experiment, even when the structure was
unambiguous \cite{speer-etal03}. In the intonation-discourse domain,
the effects of speaker/listener `awareness' or experimental design
(e.g.\ using between- vs.\ within-subjects designs, including
sufficient distractor trials, etc.) on the production/perception of
intonational cues have yet to be formally examined.\footnote{Much work
{\bf has} been done on intonational correlates to discourse in large
speech corpora of untrained speakers (e.g.\
\cite{Shriberg98,naka97DISS}), but this research has in general
focused on the prediction of intonational features from text rather
than on discourse processing per se.}  Clearly then, more research is
needed to determine the role of intonation in processing of
naturally-occurring discourse, by na\"{\i}ve speakers and listeners,
using experimental designs which do not highlight potential
ambiguities.

%=================================================================
\bibliography{/home/jennifer/jenrefs}

\end{document}

