> `!d,GY^jlxj6* `x5N;
P]I USPJh
(Hna}aaQ<}oyyμ!8@хy5)T5,2EQ(32&8s>ԑiXo~Ĵմ]aDI3~7?<9XF`uwGm$MS3⬖(õw\͢5O>^r7i?+`!<[XcF3* `x5O;
P]AJ,"(^BR
#XhcXXx`/w`y;f wh@OAaC1=jg8OKB&xݤ1dd%T1u&x3>go}{ml&,*^f8)fs\bZ7{([zb̎8~{/L?,W`!DfLۯ]%E?UWt @ 2PVMxcdd``bd``beV dX,XĐ ɁSADRcgb 6a@=P5<%!@qF~31;a#L!
~
AyPF1q%~)քXP6>d~7_?`F
k0V]vI6?i}(pGX(wm4$Do:EXޙQlMoٙŉr'=[fdd8$mID
p}:Ε,_Dh0k}RYS`!?WC}CNȂ\
xcdd``$D@9@, fbd02,(1dbfaR`_.d3H1ibYK8PT
obIFHeA*C# :@@ l.#t`0ro&AaabMVKWM̃D(ΰԆ iD^P@penR~\&`)f`S f
0y{aĤ\Y\p+L&`'29`!W>#Qƕi@8 2gxڍRNA=88PD.5`b(1D'H
>`sfYC73E&y',ބ=HAh.kOа4Kss֢)X@ {,I~}R\zAa~糤_@_boNzC]LKB2xq
0Qr4TXY^y1bک܊w6` ݳo%bjganR}.E&oϜ0uf!_&>yWN7caTw1oH7b4`k_V4"b}(
?05bEquation Equation.30,Microsoft Equation 3.00
6bEquation Equation.30,Microsoft Equation 3.007bEquation Equation.30,Microsoft Equation 3.00<bEquation Equation.30,Microsoft Equation 3.00AbEquation Equation.30,Microsoft Equation 3.00,FbEquation Equation.30,Microsoft Equation 3.00IbEquation Equation.30,Microsoft Equation 3.00.JbEquation Equation.30,Microsoft Equation 3.0/0DTimes New Roman8b8bDb0hbhb~0DWingdingsRoman8b8bDb0hbhb~0 DSymbolgsRoman8b8bDb0hbhb~0
@n?" dd@ @@```
X
yz/
#r$4%&'()+" 3!A,.!42$d,GY^jlxj6z2$<[XcF3z2$DfLۯ]%E?UWz2$cgf+VULz2$ Ine3J}z2$WC}CNȂ\Gkz2$W>#Qƕi@8zc$@8Ug4=d=dtb~0lbN
:ppp@<42d2dXv8b<4!d!dXv8b:2___PPT9/0?%kg
Lecture 13
Corpus Linguistics$0From KnowledgeBased to CorpusBased LinguisticsA Paradigm Shift begins in the 1980s
Seeds planted in the 1950s (Harris, Firth)
Cut off by Chomsky
Renewal due to
Interest in practical applications (ASR, MT, & )
Availability at major industrial labs of powerful machines and large amounts of storage
Increasing availability of large online texts and speech data
Crossover efforts with ASR community, fostered by DARPA@%ZNZZ%N[3.
cFor many practical tasks, statistical methods perform better
Less knowledge required by researchers
Next Word PredictionAn ostensibly artificial task: predicting the next word in a sequence.
From a NY Times story...
Stocks plunged this & .
Stocks plunged this morning, despite a cut in interest rates
Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ...
Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began
v``)<%]
Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last &
Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday's terrorist attacks.:?j(Human Word PredictionClearly, at least some of us have the ability to predict future words in an utterance.
How?
Domain knowledge
Syntactic knowledge
Lexical knowledge
&\8\8ClaimA useful part of the knowledge needed to allow Word Prediction (guessing the next word) can be captured using simple statistical techniques.
In particular, we'll rely on the notion of the probability of a sequence (e.g., sentence) and the likelihood of words cooccurring
b/~(
Why would we want to do this?Why would anyone want to predict a word?
If you say you can predict the next word, it means you can rank the likelihood of sequences containing various alternative words, or, alternative hypotheses
You can assess the likelihood/goodness of an hypothesisV))
Many NLP problems can be modeled as mapping from one string of symbols to another.
In statistical language applications, knowledge of the source (e.g, a statistical model of word sequences) is referred to as a Language Model or a Grammar
:Why is this useful?Example applications that employ language models:
Speech recognition
Handwriting recognition
Spelling correction
Machine translation systems
Optical character recognizersReal Word Spelling ErrorsThey are leaving in about fifteen minuets to go to her house.
The study was conducted mainly be John Black.
The design an construction of the system will take more than a year.
Hopefully, all with continue smoothly in my absence.
Can they lave him my messages?
I need to notified the bank of& .
He is trying to fine out.
AZAHandwriting RecognitionAssume a note is given to a bank teller, which the teller reads as I have a gub. (cf. Woody Allen)
NLP to the rescue & .
gub is not a word
gun, gum, Gus, and gull are words, but gun has a higher probability in the context of a bankyoD)3,L)lFor Spell CheckersGCollect a list of commonly substituted words
piece/peace, whether/weather, their/there ...
Whenever you encounter one of these words in a sentence, construct the alternative sentence as well
Assess the goodness of each and choose the one (word) with the more likely sentence
E.g.
On Tuesday, the whether
On Tuesday, the weatherV5The Noisy Channel ModelA probabilistic model developed by Claude Shannon to model communication (as over a phone line)
Noisy Channel O
= argmaxPr(IO) = argmaxPr(I) Pr(OI)
I I
the most likely input
Pr(I) the prior probability
Pr(IO) the most likely I given O
Pr(OI) the probability that O is the output if I is the input@`na,Review: Basic ProbabilityPrior Probability (or unconditional probability)
P(A), where A is some event
Possible events: it raining, the next person you see being Scandinavian, a child getting the measles, the word warlord occurring in the newspaper
Conditional Probability
P(A  B)
the probability of A, given that we know B
E.g. it raining, given that we know it s October; the next person you see being Scandinavian, given that you re in Sweden, the word warlord occurring in a story about Afghanistan
1ZZZZZ1(:/ExampleZ F F F F F F I I I I
P(Finn) = .6
P(skier) = .5
P(skierFinn) = .67
P(Finnskier) = .8
C Relating Conditionals and PriorsP(A  B) = P(A ^ B) / P(B)
Or, P(A ^ B) = P(A  B) P(B)
Bayes Theorem lets us calculate P(BA) in terms of P(AB), e.g. P(towant) via P(wantto)
P(BA) = P(B ^ A)/P(A) = P(AB)P(B)/P(A)
I.e. calculate probability of next word in sequence from unigram and bigram probabilities we ve seen
P(towant) = P(to ^ want)/P(want) = P(wantto)P(to)/P(want)bZB>9U;0Example F F F F F F I I I I
P(Finn) = .6
P(skier) = .5
P(skier ^ Finn) = .4
P(skierFinn) = .67
P(Finnskier) = .8
P(skierFinn) = P(skier ^ Finn)/P(Finn) = .4/.6 = .67
P(Finnskier) = P(Finn ^ skier)/P(skier) = .4/.5 = .8
P(Finnskier) = P(skierFinn) P(Finn)/P(skier) = (.67 * .6)/.5 = .8
t &
Chain RulerThe Chain Rule generalizes
P(A ^ B) = P(A) P(BA)
to multiple events P(A1 ^ & ^ An) = P(A1) P(A2A1) P(A3A1 ^ A2)& P(An )
Or&
P(the ^ dog) = P(dog  the)
P(the ^ dog ^ bites) = P(the) P(dog  the) P(bites the ^ dog)
So, the probability of a word sequence is the probability of a conjunctive event.
ZkZZ[ZRZZ
k[R2
Relative word frequencies are better than equal probabilities for all words
In a corpus with 10K word types, each word would have P(w) = 1/10K
Does not match our intuitions that different words are more likely to occur (e.g. the)
Conditional probability more useful than individual relative word frequencies
Dog may be relatively rare in a corpus
But if we see barking, P(dogbarking) may be very largeLO_LO2For a Word StringIn general, the probability of a complete string of words w1& wn is
P(w1..n) = P(w1)P(w2w1)P(w3w1..w2)& P(wnw1& wn1) =
But this approach to determining the probability of a word sequence is not very helpful in general& .
JC7eze>=*n Markov AssumptionP(wN) can be approximated using only N1 previous words of context
This lets us collect statistics in practice
A bigram model: P(the barking dog) = P(the)P(barkingthe)P(dogbarking)
Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past
Order of a Markov model: length of prior contextC?>>m!Counting Words in Corpora<Probabilities are based on counting things, so & .
What should we count?
Words, word classes, word senses, speech acts & ?
What is a word?
e.g., are cat and cats the same word?
September and Sept?
zero and oh?
Is seventytwo one word or two? AT&T?
Where do we find the things to count?
n' )' CorporaCorpora are (generally online) collections of text and speech
e.g.
Brown Corpus (1M words)
Wall Street Journal and AP News corpora
ATIS, Broadcast News (speech)
TDT (text and speech)
Switchboard, Call Home (speech)
TRAINS, FM Radio (speech)
6CC1,Training and TestingzProbabilities come from a training corpus, which is used to design the model.
overly narrow corpus: probabilities don't generalize
overly general corpus: probabilities don't reflect task or domain
A separate test corpus is used to evaluate the model, typically using standard metrics
held out test set
cross validation
evaluation differences should be statistically significantNxW^%x%^"Terminology*Sentence: unit of written language
Utterance: unit of spoken language
Word Form: the inflected form that appears in the corpus
Lemma: lexical forms having the same stem, part of speech, and word sense
Types: number of distinct words in a corpus (vocabulary size)
Tokens: total number of words#Simple NGramsAn Ngram model uses the previous N1 words to predict the next one:
P(wn  wn 1)
We'll pretty much always be dealing with P(  )
unigrams: P(dog)
bigrams: P(dog  big)
trigrams: P(dog  the big)
quadrigrams: P(dog  chasing the big)nEQi6EitGH $
Using NGramsBRecall that
P(wn  w1..n1) P(wn  wnN+1..n1)
For a bigram grammar
P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence
E.g.
P(I want to each Chinese food) = P(I  <start>) P(want  I) P(to  want) P(eat  to) P(Chinese  eat) P(food  Chinese)'\x
\xb?B2#A Bigram Grammar Fragment from BERPD3
C1
xP(I want to eat British food) = P(I) P(wantI) P(towant) P(eatto) P(Britisheat) P(foodBritish) = .25*.32*.65*.26*.001*.60 = .000080
vs. I want to eat Chinese food = .00015
Probabilities seem to capture ``syntactic'' facts, ``world knowledge''
eat is often followed by a NP
British food is not too popular
Ngram models can be trained by counting and normalization>;xQ;
&!BERP Bigram Counts
'"BERP Bigram ProbabilitiesNormalization: divide each row's counts by appropriate unigram counts
Computing the probability of I I
C(II)/C(all I)
p = 8 / 3437 = .0023
A bigram grammar is an NxN matrix of probabilities, where N is the vocabulary sizeJk%Sg%S>7R9(#$What do we learn about the language?What's being captured with ...
P(want  I) = .32
P(to  want) = .65
P(eat  to) = .26
P(food  Chinese) = .56
P(lunch  eat) = .055
What about...
P(I  I) = .0023
P(I  want) = .0025
P(I  food) = .013Lg8g8)$
rP(I  I) = .0023 I I I I want
P(I  want) = .0025 I want I want
P(I  food) = .013 the kind of food I want is ...Ns*%Approximating Shakespeare~As we increase the value of N, the accuracy of the ngram model increases
Generating sentences with random unigrams...
Every enter now severally so, let
Hill he late speaks; or! a more to leg less first you enter
With bigrams...
What means, sir. I confess she? then all sorts, he is trim, captain.
Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.fwZ^ZZZw^,kg+&
Trigrams
Sweet prince, Falstaff shall die.
This shall forbid it should be branded, if renown made it empty.
Quadrigrams
What! I will go seek the traitor Gloucester.
Will you not tell me who I am?
p cM bM6 MP,'
There are 884,647 tokens, with 29,066 word form types, in about a one million word Shakespeare corpus
Shakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table).
Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare.
All those zeroes are causing problems.$T5P#3u0+NGram Training SensitivityIf we repeated the Shakespeare experiment but trained on a Wall Street Journal corpus, there would be little overlap in the output
This has major implications for corpus selection or design("Some Useful Empirical ObservationsA small number of events occur with high frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency events
You might have to wait an arbitrarily long time to get valid statistics on low frequency events
Some of the zeroes in the table are really zeroes. But others are simply low frequency events you haven't seen yet. How to fix?ZE4Smoothing TechniquesEvery ngram training matrix is sparse, even for very large corpora (Zipf s law)
Solution: estimate the likelihood of unseen ngrams
Addone smoothing:
Add 1 to every bigram count
Normalize by N/(N+V)
Smoothed count is
Discount: ratio of new counts to old (e.g. addone smoothing changes the BERP bigram (towant) from 786 to 331 (and p from .65 to .28)&b81#}2H5
tWe d like to find methods that don t change the original probabilities so drastically
WittenBell Discounting
A zero ngram is just an ngram you haven t seen yet&
Model unseen bigrams by the ngrams you ve only seen once (i.e. the total number of word types in the corpus)
Total probability of unseen bigrams estimated as
View training corpus as series of events, one for each token (N) and one for each new type (T)
We can divide the probability mass equally among unseen bigrams& .or we can condition the probability of an unseen bigram on the first word of the bigram
DnZZZZnV#f3K6
:Discount values for WittenBell are much more reasonable than AddOne
GoodTuring Discounting
Reestimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher counts
Estimate
Assumes:
word bigrams follow a binomial distribution
We know number of unseen bigrams (VxVseen)
`FXFXt8 M7
Backoff methods (e.g. Katz)
Build e.g. a trigram model
Where trigram unavailable back off to bigram if available, o.w. unigram probability
E.g An omnivorous unicorn
n56b!'N8
Next classIMidterm
Next class:
Hindle & Rooth 1993
Begin studying semantics, Ch. 14&55,&Pb ` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dd@ z?" dd@ " @ ` n?" dd@ @@``PV @ ` `(p>>c(
6D v `v
T Click to edit Master title style!
!
0 v v
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0!v ``v
=*
0d!v ` v
?*
0!v ` v
?*H
0h ? f $Blank Presentation nf@( }@h@
6d v
<Lecture
0$ Pv
W#Click to edit Master subtitle style$
$
0 ``v
=*
0 ` v
9CS 4705
0D ` v
?*H
0h ? f0(
l
CPp
l
C0`
H
0h ? ̙33
P(
l
C֑`
l
CdՑ
H
0h ? f
`$(
r
SDב`
v
r
Sב v
H
0h ? f
p$D( `x@@F@
$l
$ Cdؑ`
v
$ Cؑ <$
0
H
$0h ? f
@8((
(
( Cّ<$
0
H
(0h ? f
,( hx@
,l
, CDڑ`
l
, Cڑ
H
,0h ? f
0(
0l
0 Cdۑ`
l
0 Cۑ
H
00h ? f
4(
4l
4 C$ܑ`
l
4 Cܑ
H
40h ? f
<( V@@F@
<l
< Cݑ`
l
< Cޑ
H
<0h ? f
@( @C@
@l
@ Cߑ`
l
@ Cߑ
H
@0h ? f
H( @C@
Hl
H C
l
H Cd`
H
H0h ? f
D( @C@
Dl
D Cv`
l
D Cv
H
D0h ? f
L( w
Ll
L Ci`
l
L Cj
H
L0h ? f
8P(
8l
8 Cj`
l
8 C4k
`
8
c$A??Vp`
8
c$A
??&P
`
8
c$A??
@H
80h ? f
P(
Pl
P Cl`
l
P Cm
H
P0h ? f
0$(
r
S4n`
r
Sn
H
0h ? f
@X( xZ
Xl
X Cn`
l
X CTo
H
X0h ? f
P0(
x
c$p`
x
c$4q
H
0h ? f@
``( xZ
`l
` Cq`
l
` Cq
`
`
c$A??8@pH
`0h ? fd
p(
l
CTr
H
0h ? f@
d( xZ
dl
d Cs`
l
d C4t
`
d
c$A??p
H
d0h ? f
l(
ll
l Ct`
l
l CTu
H
l0h ? f
p( @C@
pl
p CDے`
l
p Cے
H
p0h ? f
(
l
C$ݒ`
l
Cݒ
H
0h ? f
(
l
CDޒ`
l
Cޒ
H
0h ? f
t( xZ
tl
t Cdߒ`
v
l
t Cߒ
H
t0h ? f
x( xZ
xl
x C`
l
x C
H
x0h ? f
( xZ
l
 C
l
 C0
H
0h ? fd
12(
r
S`
r
2 #"2&\\\\\\\\
#
<$?
J.001 @`
"
<?@
QEat British @`
!
<D?x
@
I.03 @`
<?
x
O Eat today
@`
<O?H
J.007 @`
<TP?@H
QEat dessert @`
<P?xH@
I.04 @`
<tQ?Hx
P
Eat Indian @`
<Q?
H
I.01 @`
<R?@
H
REat tomorrow
@`
<R?x
@H
I.04 @`
<TS?
xH
KEat a @`
<S?
I.02 @`
<tT?@
QEat Mexican @`
<T?x @
I.04 @`
<4U? x
LEat at @`
<U?4
I.02 @`
<TV?@4
QEat Chinese @`
<V?x4@
I.05 @`
<tW?4x
P
Eat dinner @`
<W?4
I.02 @`
<4X?@4
LEat in @`
<X?x@4
I.06 @`
<TY?x4
O Eat lunch
@`
<Y?
I.03 @`
<tZ?@
S
Eat breakfast @`
<Z?x@
I.06 @`
<d?x
NEat some @`
<ľ? 
I.03 @`
<?@ 
NEat Thai @`
<?x @
I.16 @`
<D? x
LEat on @``B
$
0o
? ZB
%
s*1
?ZB
&
s*1
?ZB
'
s*1
?44ZB
(
s*1
? ZB
)
s*1
?
ZB
*
s*1
?HHZB
+
s*1
?
`B
,
0o
?`B

0o
? ZB
.
s*1
?x xZB
/
s*1
?@ @ZB
0
s*1
? `B
1
0o
? H
0h ? f'
&&:@S&(
&z _
@
#":.
(
+
<?@_
I.01 @`
*
<d?` @_
S
British lunch @`
)
<?x@` _
I.05 @`
(
<$?@x_
LWant a @`
'
<?
@
I.01 @`
&
<D?`
@
UBritish cuisine @`
%
<?x
` @
I.65 @`
$
<?
x@
MWant to @`
#
<d?
I.15 @`
"
<$?`
XBritish restaurant @`
!
<?x`
I.04 @`
<?x
LI have @`
<D?
I.60 @`
<?`
RBritish food
@`
<d?x
`
I.08 @`
<?
x
TI don t @`
<$?
I.02 @`
<?`
KTo be @`
<?x `
I.29 @`
<D? x
MI would @`
<?
I.09 @`
<$Z?`
NTo spend @`
<Z?x`
I.32 @`
<Z?x
LI want @`
<D[?}
I.14 @`
<[?` }
MTo have @`
<\?x}`
I.02 @`
<\?}x
\<start> I m @`
<$]?^}
I.26 @`
<]?` ^}
LTo eat @`
<]?x^` }
I.04 @`
<^?^x}
R Tell
@`
<_??^
I.01 @`
<_?` ?^
O Want Thai
@`
<$`?x?` ^
I.06 @`
<`??x^
\<start> I d @`
<Da? ?
I.04 @`
<b?` ?
O Want some
@`
<db?x ` ?
I.25 @`
<$c? x?
O I
@``B
,
0o
? ZB

s*1
???ZB
.
s*1
?^^ZB
/
s*1
?}}ZB
0
s*1
?ZB
1
s*1
? ZB
2
s*1
?
ZB
3
s*1
?ZB
4
s*1
?
ZB
5
s*1
?@@`B
6
0o
?__`B
7
0o
? _ZB
8
s*1
?x x_ZB
9
s*1
?` ` _ZB
:
s*1
? _`B
;
0o
? _H
0h ? fj
(
r
Sc
H
0h ? f9
990Uq!9( xZ
l
Cde`
u8r `p
q`p
#"2&@P@@@@@@
D
<e?0p
G0 @`
C
<?0p
G1 @`
B
<?`0p
G0 @`
A
<d?
0`p
G0 @`
@
<? 0
p
G0 @`
?
<$?0 p
G0 @`
>
<?0p
G4 @`
=
<?0p
KLunch @`
<
<D?
0
G0 @`
;
<?
0
G0 @`
:
<?`
0
G0 @`
9
<d?
`0
G0 @`
8
<?
0
H17 @`
7
<$?
0
G0 @`
6
<?
0
H19 @`
5
<?
0
JFood @`
4
<D?
G1 @`
3
<?
I120 @`
2
<?`
G0 @`
1
<d?
`
G0 @`
0
<?
G0 @`
/
<$?
G0 @`
.
<?
G2 @`

<?
MChinese @`
,
<D?p
H52 @`
+
<?p
G2 @`
*
<?`p
H19 @`
)
<d?
p`
G0 @`
(
<? p
G2 @`
'
<$?p
G0 @`
&
<?p
G0 @`
%
<?p
IEat @`
$
<D?0p
H12 @`
#
<h?0p
G0 @`
"
<dh?`0p
G3 @`
!
<h?
0`p
I860 @`
<$i? 0
p
H10 @`
<i?0 p
G0 @`
<i?0p
G3 @`
<Dj?0p
HTo @`
<j?0
G6 @`
<k?0
G8 @`
<dk?`0
G6 @`
<k?
`0
G0 @`
<$l?
0
I786 @`
<l? 0
G0 @`
<l?0
G3 @`
<Dm?0
JWant @`
<m?
G0 @`
<n?
G0 @`
<dn?`
G0 @`
<n?
`
H13 @`
<$o?
G0 @`
<o?
J1087 @`
<o?
G8 @`
<Dp?
GI @`
<p?`
Klunch @`
<q?`
JFood @`
<dq?``
MChinese @`
<q?
``
IEat @`
<$r? `
HTo @`
<r?`
JWant @`
<r?`
GI @`
<Ds?`
P
@``B
E
0o
?``ZB
F
s*1
?ZB
G
s*1
?ZB
H
s*1
?00ZB
I
s*1
?ppZB
J
s*1
? ZB
K
s*1
?
ZB
L
s*1
?00`B
M
0o
?p
p
`B
N
0o
?`p
ZB
O
s*1
?`p
ZB
P
s*1
?`p
ZB
Q
s*1
? ` p
ZB
R
s*1
?
`
p
ZB
S
s*1
?```p
ZB
T
s*1
?`p
ZB
U
s*1
?`p
`B
V
0o
?`p
H
0h ? f_
@3( xZ
l
Cs
l
C
Z @p
3@p#"
<? Pp
I459 @`
<d?P p
J1506 @`
<?
Pp
I213 @`
<$? P
p
I938 @`
<?P p
J3256 @`
<?PPp
J1215 @`
<D?@PPp
J3437 @`
<? P
KLunch @`
<? P
JFood @`
<d?
P
MChinese @`
<?
P
IEat @`
<$? P
HTo @`
<?PP
JWant @`
<?@PP
GI @``B
0o
?@ZB
s*1
?@PP`B
0o
?@pp`B
0o
?@@pZB
s*1
?PPpZB
s*1
?pZB
s*1
? pZB
s*1
?
pZB
s*1
?pZB
s*1
? p`B
0o
?pH
0h ? f
P( @C@
l
C`
l
C
H
0h ? f
`( xZ
l
Cd`
l
C
H
0h ? f
pD(
l
C$`
C <$
0
H
0h ? fd
( xZ
l
CDP
H
0h ? fd
(
l
C
H
0h ? f
( Xx@F@
l
C$`
l
C
H
0h ? f
(
l
CD
l
C
H
0h ? fL
(
r
St
r
SԺ
`
c$A,??
,H
0h ? f
z(
r
S4P
`
c$A?? L
4XH
0h ? f
z(
r
ST
`
c$A.??l (t .H
0h ? f
{s (
x
c$t
S
b P
#""`P
<Խ?
P
H& @`
<4?
P
H75 @`
<? P
G2 @`
<?
H.6 @`
<T?
I250 @`
<?
G1 @`
<?
I.05 @`
<t?
J5000 @`
<?
G0 @`
<4?
MC* (GT) @`
<?
lNc @`
<?
MC (MLE) @``B
0o
?ZB
s*1
? ZB
s*1
?`B
0o
?P
P
`B
0o
?P
ZB
s*1
?P
ZB
s*1
?
P
`B
0o
?P
ZB
s*1
? H
0h ? f
$(
r
S$`
v
r
S v
H
0h ? fKxVnAvl ''(N !J
Grဒ"p.<T)("҅7.ݽpvfvάw?N~3#0M٦Q0#s8ƱÌا1C7?Yds^} m1Ë;_wayrPORx;&D_jĔ_oXlwŜ~cƟI/hROp;q=ߡ5~d(>!khWqoWjAoy9xʡ
y#ovL^i?$طJr_B=GTY+Y,ERR2OS@ܞb)+J,ms59Tk~[JsBqѬwooZ\&œYer(>ȶ^:hx\]g:w{Eta
M,eBZ_ɳVk*ZvrKĴOٖW"f?2:7XDǻ'I89pi73wSPxVnAvl ''(N !JH$D%)RE2pFΎA"x
(GHt LAE3{݅ﴷ3fnwfy`{g+pac10MH0g
GԲf3cF{XOOexo;\
~<I*=دv>
eՔx7g6ǓtED(#R~qޑo7{IIϩB(KIL*ˬfv^F
;$])yE=j=ZJy3jxmrđŌDqpY5${2t.y{٦+T5
ʢe^Hym܆{u!/Z}oW%[BX
I/ɖ,~NeAn:݃X4f[١ݏד$gn_VPxVnAvl)NOQ(@B4H%)RE2pFg wtJAx~Jcfo/:\vfvά?{fFpA12MчAD`nYOc=Ǉ2o7Z
~sH>۟/eCQMpSqaT< 7b?Qji(1%F&].:Or̘3);E/@j)n1;TxOego4ZuM)55<
qT*Դ;f(CZPƗwEE=*}9m ?nc9xVOAlp0`uPLńHG)B[9xыcԳƓ'?c7ٺ [If>7o{w>*CQ?,ˮ\cE4L}zߡ)aa9zX
~Jm0+5@J^rIt_{%ܞrKHKaHݏup_gL8pf =?ȧQ@zSsL%?
4i1"44mz8$kIPa'kgny]YKٙWtu"S2}CrŊ?9;6[2ٸ{Ri9CʔrDH%J2 Hw]T&CVkfgЕP3'Fn&0S\V/W
Hf2tH[Sl*g$©9=xg~@^~LF^
G}dՈ~Yɼ5d
ɢG?s$K]Y3`~RE*vt1;_t#<bl1O+<+x?X>3"R偐kXYǗ7&Uԩ?~mk~);☚B",3"qor9xE Dutrkr{lM'$ZJ*Wɦ.ōbkH͟{0GVs3:j8wSҤ@N%ixVOAfv)Z* !6PŐFńBJEmIxNyguݗ(9L;flTK\کuP1EZg9OA`XN5Y'2zޞnLڞT4 Sry?rP ?"٘GzR幜<"$F&Ġ K6K@z}(S1bxsu0Z*lR
&ߨߩKBj59yB5Ϊc*JR0jLU,;C~ņ#GwG9ؖDΎ$ڋT<
Uww,{,GDZnhdx?e}~Vw`8r'hғ'UiAgJPLॊ!p'R@ƱΩYx?9(ͩ!!,{ B5C$,`aB"mO9x:mbw/6Sj`$nf1 } oxixV=oA}gCpQDqM
AH؉b8!F>_ۀk$h Q"B4H*PPb33g.(}{3~j?bg`Ob"S l@u]jq
c}a=m8ȣA⃜z_joöo
b
jꨎF
Zu$*̟%~lfN䉎)'S{ϵk~jI
_Ea*Ԯlzo䛝J524t4!Mr f4~x+Ì]~Ox"z+PF9ғbs9Y4 IuZ2IKdwlbMټw=vxQ"MJ{ݠR8jcFO(ImH.%o;]71
\sXxnWQR/&.ٶr*`וڇ;$(y2
N5˙ޤLٗ5BJ3,fT
m_HZV)+zt9v=!ovvݰMS\aZݎ`or+;wIWx$!+{&'?_t@UWiG!iީp`n>#D*[L&KwUZQDDX^2uA{~_)Öw9F+0+唊vPx!'^qRE:T^^c5XWÄF@cU^Rrx
8.XC ? x6bT+r25¤Bmw oBnVlAn~a
Dfwtdor0};u
kK+˟o G PC#û&\
EDUWlYx[\0`0@P^50x*{}:0OA`cbH@dd>fM hv4NW(
,
G0<bsearchcb1#Times New Roman
WingdingsSymbolBlank PresentationMicrosoft Equation 3.0Lecture 14!Relating Conditionals and PriorsExampleChain RuleNo Slide TitleFor a Word StringMarkov AssumptionCounting Words in CorporaCorporaTraining and TestingTerminologySimple NGramsUsing NGrams$A Bigram Grammar Fragment from BERPNo Slide TitleNo Slide TitleBERP Bigram CountsBERP Bigram Probabilities%What do we learn about the language?No Slide TitleApproximating ShakespeareNo Slide TitleNo Slide TitleNGram Training Sensitivity#Some Useful Empirical ObservationsSmoothing TechniquesNo Slide TitleNo Slide TitleNo Slide TitleNext classFonts UsedDesign TemplateEmbedded OLE Servers
Slide Titles 6>
_PID_GUIDAN{851526C0E11F11D6A52D00606734A8F0}
."System&TNPP &
՜.+,D՜.+,
4Onscreen ShoweAT&T LabsResearchcb1#Times New Roman
WingdingsSymbolBlank PresentationMicrosoft Equation 3.0Lecture 14!Relating Conditionals and PriorsExampleChain RuleNo Slide TitleFor a Word StringMarkov AssumptionCounting Words in CorporaCorporaTraining and TestingTerminologySimple NGramsUsing NGrams$A Bigram Grammar Fragment from BERPNo Slide TitleNo Slide TitleBERP Bigram CountsBERP Bigram Probabilities%What do we learn about the language?No Slide TitleApproximating ShakespeareNo Slide TitleNo Slide TitleNGram Training Sensitivity#Some Useful Empirical ObservationsSmoothing TechniquesNo Slide TitleNo Slide TitleNo Slide TitleNext classFonts UsedDesign TemplateEmbedded OLE Servers
Slide Titles 6>
_PID_GUIDAN{851526C0E11F11D6A52D00606734A8F0}moothing TechniquesNo Slide TitleNo Slide TitleNo Slide TitleNext classFonts UsedDesign TemplateEmbedded OLE Servers
Slide Titles 6>
_PID_GUIDAN{851526C0E11F11D6A52D00606734A8F0}Equation Equation.30,Microsoft Equation 3.00AbEquation Equation.30,Microsoft Equation 3.00,FbEquation Equation.30,Microsoft Equation 3.00IbEquation Equation.30,Microsoft Equation 3.00.JbEquation Equation.30,Microsoft Equation 3.0/0DTimes New Romanbb(b0LbLb~0DWingdingsRomanbb(b0LbLb~0 DSymbolgsRomanbb(b0LbLb~0
@n?" dd@ @@```
X
yK "r#4$%&'(*
!"#$%&'()*+,./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{}~!3 A+,.!/4$%$%2$DfLۯ]%E?UW%2$cgf+VUL%2$ Ine3J}%2$WC}CNȂ\Gk%2$W>#Qƕi@8%c$@8Ug4=d=dXb~0PbN
:ppp@<42d2d)b<4!d!d)b:2___PPT9/0?%dC
Lecture 14
PowerPoint Document(DocumentSummaryInformation8Root EntryydO)hNuPicturesCurrent User0SummaryInformation(
!"#$%&'()*+,./012(_mbjulia hirschberg
Oh+'0xhp
0<
HT\Lecturejulia hirschbergeCC:\Program Files\Microsoft Office\Templates\Blank Presentation.potjulia hirschbergMic100Microsoft PowerPointoso@sJV@3^#>@ ކNuGoM '&Corpus Linguistics II$ Relating Conditionals and PriorsP(A  B) = P(A ^ B) / P(B)
Or, P(A ^ B) = P(A  B) P(B)
Bayes Theorem lets us calculate P(BA) in terms of P(AB), e.g. P(towant) via P(wantto)
P(BA) = P(B ^ A)/P(A) = P(AB)P(B)/P(A)
I.e. calculate probability of next word in sequence from unigram and bigram probabilities we ve seen
P(towant) = P(to ^ want)/P(want) = P(wantto)P(to)/P(want)bZB>9U;0Example F F F F F F I I I I
P(Finn) = .6
P(skier) = .5
P(skier ^ Finn) = .4
P(skierFinn) = .67
P(Finnskier) = .8
P(skierFinn) = P(skier ^ Finn)/P(Finn) = .4/.6 = .67
P(Finnskier) = P(Finn ^ skier)/P(skier) = .4/.5 = .8
P(Finnskier) = P(skierFinn) P(Finn)/P(skier) = (.67 * .6)/.5 = .8
t &
Chain RuleThe Chain Rule generalizes
P(A ^ B) = P(A) P(BA)
to multiple events P(A1 ^ & ^ An) = P(A1) P(A2A1) P(A3A1 ^ A2)& P(An )
Or&
P(the ^ dog) = P(the) P(dog  the)
P(the ^ dog ^ bites) = P(the) P(dog  the) P(bites the ^ dog)
So, the probability of a word sequence is the probability of a conjunctive event.
ZkZZbZRZZ
kbR2
Relative word frequencies are better than equal probabilities for all words
In a corpus with 10K word types, each word would have P(w) = 1/10K
Does not match our intuitions that different words are more likely to occur (e.g. the)
Conditional probability more useful than individual relative word frequencies
Dog may be relatively rare in a corpus
But if we see barking, P(dogbarking) may be very largeLO_LO2For a Word StringIn general, the probability of a complete string of words w1& wn is
P(w1..n) = P(w1)P(w2w1)P(w3w1..w2)& P(wnw1& wn1) =
But this approach to determining the probability of a word sequence is not very helpful in general& .
JC7eze>=*n Markov AssumptionP(wN) can be approximated using only N1 previous words of context
This lets us collect statistics in practice
A bigram model: P(the barking dog) = P(the)P(barkingthe)P(dogbarking)
Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past
Order of a Markov model: length of prior contextC?>>m!Counting Words in Corpora<Probabilities are based on counting things, so & .
What should we count?
Words, word classes, word senses, speech acts & ?
What is a word?
e.g., are cat and cats the same word?
September and Sept?
zero and oh?
Is seventytwo one word or two? AT&T?
Where do we find the things to count?
n' )' CorporaCorpora are (generally online) collections of text and speech
e.g.
Brown Corpus (1M words)
Wall Street Journal and AP News corpora
ATIS, Broadcast News (speech)
TDT (text and speech)
Switchboard, Call Home (speech)
TRAINS, FM Radio (speech)
6CC1,Training and TestingzProbabilities come from a training corpus, which is used to design the model.
overly narrow corpus: probabilities don't generalize
overly general corpus: probabilities don't reflect task or domain
A separate test corpus is used to evaluate the model, typically using standard metrics
held out test set
cross validation
evaluation differences should be statistically significantNxW^%x%^"Terminology*Sentence: unit of written language
Utterance: unit of spoken language
Word Form: the inflected form that appears in the corpus
Lemma: lexical forms having the same stem, part of speech, and word sense
Types: number of distinct words in a corpus (vocabulary size)
Tokens: total number of words#Simple NGramsAn Ngram model uses the previous N1 words to predict the next one:
P(wn  wn 1)
We'll pretty much always be dealing with P(  )
unigrams: P(dog)
bigrams: P(dog  big)
trigrams: P(dog  the big)
quadrigrams: P(dog  chasing the big)nEQi6
EitGH $
Using NGramsBRecall that
P(wn  w1..n1) P(wn  wnN+1..n1)
For a bigram grammar
P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence
E.g.
P(I want to each Chinese food) = P(I  <start>) P(want  I) P(to  want) P(eat  to) P(Chinese  eat) P(food  Chinese)'\x
\xb?B2#A Bigram Grammar Fragment from BERPD3
C1
xP(I want to eat British food) = P(I) P(wantI) P(towant) P(eatto) P(Britisheat) P(foodBritish) = .25*.32*.65*.26*.001*.60 = .000080
vs. I want to eat Chinese food = .00015
Probabilities seem to capture ``syntactic'' facts, ``world knowledge''
eat is often followed by a NP
British food is not too popular
Ngram models can be trained by counting and normalization>;xQ;
&!BERP Bigram Counts
'"BERP Bigram ProbabilitiesNormalization: divide each row's counts by appropriate unigram counts
Computing the probability of I I
C(II)/C(all I)
p = 8 / 3437 = .0023
A bigram grammar is an NxN matrix of probabilities, where N is the vocabulary sizeJk%Sg%S>7R9(#$What do we learn about the language?What's being captured with ...
P(want  I) = .32
P(to  want) = .65
P(eat  to) = .26
P(food  Chinese) = .56
P(lunch  eat) = .055
What about...
P(I  I) = .0023
P(I  want) = .0025
P(I  food) = .013Lg8g8)$
rP(I  I) = .0023 I I I I want
P(I  want) = .0025 I want I want
P(I  food) = .013 the kind of food I want is ...Ns*%Approximating Shakespeare~As we increase the value of N, the accuracy of the ngram model increases
Generating sentences with random unigrams...
Every enter now severally so, let
Hill he late speaks; or! a more to leg less first you enter
With bigrams...
What means, sir. I confess she? then all sorts, he is trim, captain.
Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.fwZ^ZZZw^,kg+&
Trigrams
Sweet prince, Falstaff shall die.
This shall forbid it should be branded, if renown made it empty.
Quadrigrams
What! I will go seek the traitor Gloucester.
Will you not tell me who I am?
p cM bM6 MP,'
There are 884,647 tokens, with 29,066 word form types, in about a one million word Shakespeare corpus
Shakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table).
Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare.
All those zeroes are causing problems.$T5P#3u0+NGram Training SensitivityIf we repeated the Shakespeare experiment but trained on a Wall Street Journal corpus, there would be little overlap in the output
This has major implications for corpus selection or design("Some Useful Empirical ObservationsA small number of events occur with high frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency events
You might have to wait an arbitrarily long time to get valid statistics on low frequency events
Some of the zeroes in the table are really zeroes. But others are simply low frequency events you haven't seen yet. How to fix?ZE4Smoothing TechniquesEvery ngram training matrix is sparse, even for very large corpora (Zipf s law)
Solution: estimate the likelihood of unseen ngrams
Addone smoothing:
Add 1 to every bigram count
Normalize by N/(N+V)
Smoothed count is
Discount: ratio of new counts to old (e.g. addone smoothing changes the BERP bigram (towant) from 786 to 331 (and p from .65 to .28)&b81#}2H5
tWe d like to find methods that don t change the original probabilities so drastically
WittenBell Discounting
A zero ngram is just an ngram you haven t seen yet&
Model unseen bigrams by the ngrams you ve only seen once (i.e. the total number of word types in the corpus)
Total probability of unseen bigrams estimated as
View training corpus as series of events, one for each token (N) and one for each new type (T)
We can divide the probability mass equally among unseen bigrams& .or we can condition the probability of an unseen bigram on the first word of the bigram
DnZZZZnV#f3K6
:Discount values for WittenBell are much more reasonable than AddOne
GoodTuring Discounting
Reestimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher counts
Estimate
Assumes:
word bigrams follow a binomial distribution
We know number of unseen bigrams (VxVseen)
`FXFXt8 M7
Backoff methods (e.g. Katz)
Build e.g. a trigram model
Where trigram unavailable back off to bigram if available, o.w. unigram probability
E.g An omnivorous unicorn
n56b!'N8
Next classIMidterm
Next class:
Hindle & Rooth 1993
Begin studying semantics, Ch. 14&55,&PbrX]N &&#TNPP0D
v
&
TNPP &&TNPP
 " ! "&G&
 &Gy& Times New Roman .2
CS 4705
.iH8 Times New Roman .2
^
Lecture 14!
.1 Times New Roman .'2
;Corpus Linguistics II
."System&TNPP &
՜.+,D՜.+,
4Onscreen ShoweAT&T LabsRe