Lecture 2 | Word Vector Representations: word2vec

Google+ Pinterest LinkedIn Tumblr

[MUSIC] Stanford University. >> Okay, so let's get going. Welcome back to the second
class of CS224N /Ling 284, Natural Language Processing
with Deep Learning. So this class is gonna be almost
the complete opposite of the last class. So in the last class, it was a very high level picture of
sort of trying from the very top down. Sort of say a little bit about what
is natural language processing, what is deep learning, why it's exciting,
why both of them are exciting and how I'd like to put them together? So for today's class we're gonna go
completely to the opposite extreme. We're gonna go right down
to the bottom of words, and we're gonna have vectors,
and we're gonna do baby math. Now for some of you this will seem
like tedious repetitive baby math. But I think that there are probably
quite a few of you for which having some math review
is just going to be useful. And this is really the sort of foundation
on which everything else builds.

And so if you don't have sort of straight
the fundamentals right at the beginning of how you can use neural networks on the
sort of very simplest kind of structures, it's sort of really all over from there. So what I'd like to do today is
sort of really go slowly and carefully through the foundations of how
you can start to do things with neural networks in this very simple case of
learning representations for words.

And hope that's kind of a good foundation
that we can build on forwards. And indeed that's what we're gonna
keep on doing, building forward. So next week Richard is gonna keep on
doing a lot of math from the ground up to try, and really help get straight some
of the foundations of Deep Learning. Okay, so this is basically the plan. So tiny bit word meaning and no, [LAUGH]. >> [LAUGH]
>> Tiny bit on word meaning then start to introduce this model of learning
word vectors called Word2vec. And this was a model that was
introduced by Thomas Mikolov and colleagues at Google in 2013. And so there are many other
ways that you could think about having representations of words. And next week, Richard's gonna talk
about some of those other mechanisms. But today, I wanna sort of avoid
having a lot of background and comparative commentary. So I'm just gonna present
this one way of doing it. And you'd also pretty study
the good way of doing it, so it's not a bad one to know. Okay, so then after that,
we're gonna have the first or was it gonna be one of
the features of this class.

We decided that all the evidence says that
students can't concentrate for 75 minutes. So we decided we'd sort of mix
it up a little, and hopefully, also give people an opportunity to sort
of get more of a sense of what some of the exciting new work that's coming
out every month in Deep Learning is. And so what we're gonna do is have one TA
each time, do a little research highlight. Which will just be sort of a like
a verbal blog post of telling you a little bit about some recent paper
and why it's interesting, exciting. We're gonna start that today with Danqi. Then after that, I wanna go sort of carefully through the
word to vec objective function gradients.

Refresher little on optimization,
mention the assignment, tell you all about Word2vec
that's basically the plan, okay? So we kinda wonder sort
of have word vectors as I mentioned last time as
a model of word meaning. That's a pretty
controversial idea actually. And I just wanna give kind of a few words
of context before we dive into that and do it anyway. Okay, so
if you look up meaning in a dictionary cuz a dictionary is a storehouse
of word meanings after all. What the Webster's dictionary says is
meaning is the idea that is represented by a word, phrase, etc. The idea that a person wants to express
by using words, signs, etc, etc. In some sense, this is fairly close to what is the commonest linguistic
way of thinking of meaning. So standardly in linguistics,
you have a linguistic sign like a word, and then it has things that
it signifies in the world.

So if I have a word like glasses then
it's got a signification which includes these and there are lots of other pairs of
glasses I can see in front of me, right? And those things that it signifies, the denotation of the term glasses. That hasn't proven to be a notion of
meaning that's been very easy for people to make much use of in computational
systems for dealing with language. So in practice, if you look at what
computational systems have done for meanings of words over
the last several decades. By far the most common thing that's
happened is, people have tried to deal with the meaning of words by
making use of taxonomic resources.

And so if they're English, the most
famous taxonomic resource is WordNet. And it's famous,
maybe not like Websters is famous. But it's famous among
computational linguists. Because it's free to download a copy and that's much more useful than having
a copy of Webster's on your shelf. And it provides a lot of
taxonomy information about words. So this little bit of Python code. This is showing you getting a hold of
word net using the nltk which is one of the main Python packages for nlp. And so then I'm asking it for
the word panda, not the Python package Panda,
the word panda. Then I'm saying, well tell me about the hypernym the kind
of things that it's the kind of. And so for Panda it's sort of heading
up through carnivores, placentals, mammals up into sort of
abstract types like objects. Or on the right hand side,
I'm sort of asking for the word good,
will tell me about synonyms of good.

And part of what your finding there is,
well WordNet is saying, well the word good has different senses. So for each sense, let me tell
you some synonyms for each sense. So one sense, the second one is sort
of the kind of good person sense. And they're suggesting synonyms
like honorable and respectable. But there are other ones here
where this pair is good to eat and that's sort of meaning is ripe. Okay, so
you get this sort of sense of meaning. That's been,
that's been a great resource, but it's also been a resource that
people have found in practice. It's hard to get nearly as much value out
of it as you'd like to get out of it. And why is that? I mean there are a whole bunch of reasons. I mean one reason is that at this level
of this sort of taxonomic relationships, you lose an enormous amount of nuance.

So one of those synonym sets for
good was adept, expert, good, practiced, proficient, skillful. But I mean, it seems like those mean
really different things, right? It seems like saying I'm
an expert at deep learning. Means something slightly different
to I'm good at deep learning. So there's a lot of nuance there. There's a lot of incompleteness in WordNet
so for a lot of the ways that people, Use words more flexibly. So if I say I'm a deep-learning ninja, or something like that,
that that's not in WordNet at all.

What kind of things you put into these
synonym sets ends up very subjective, right? Which sense distinctions you make and
which things you do and don't say are the same,
it's all very unclear. It requires,
even to the extent that it's made, it's required many person
years of human labor. And at the end of the day,
it's sort of, it's kind of hard to get anything accurate out of it
in the way of sort of word similarities.

Like I kind of feel that proficient is
more similar to expert than good, maybe. But you can't get any of this
kind of stuff out of WordNet. Okay, so therefore,
that's sort of something of a problem. And it's part of this
general problem of discrete, or categorical, representations
that I started on last time. So, the fundamental thing
to note is that for sorta just about all NLP,
apart from both modern deep learning and a little bit of neural net work
NLP that got done in the 1980s, that it's all used atomic symbols
like hotel, conference, walk.

And if we think of that from our kind
of jaundiced neural net direction, using atomic symbols is kind of like using big vectors that are zero everywhere
apart from a one and one position. So what we have, is we have a lot of words
in the language that are equivalent to our symbols and we're putting a one
in the position, in the vector, that represents the particular symbol,
perhaps hotel. And these vectors are going to be really,
really long. I mean,
how long depends on how you look at it. So sometimes a speech recognizer
might have a 20,000 word vocabulary. So it'd be that long. But, if we're kinda building
a machine translation system, we might use a 500,000 word vocabulary,
so that's very long. And Google released sort of
a 1-terabyte corpus of web crawl.

That's a resource that's been
widely used for a lot of NLP. And while the size of the vocabulary
in that is 13 million words, so that's really, really long. So, it's a very, very big vector. And so, why are these vectors problematic? I'm sorry, I'm not remembering my slides,
so I should say my slides first. Okay, so this is referred to in
neural net land as one-hot in coding because there's just this
one on zero in the vector. And so, that's the example of
a localist representation. So why is this problematic? And the reason why it's
problematic is it doesn't give any inherent notion of
relationships between words. So, very commonly what we want
to know is when meanings and words and
phrases are similar to each other. So, for example, in a web search
application, if the user searches for Dell notebook battery size, we'd like to match a document that
says Dell laptop battery capacity.

So we sort of want to know that
notebooks and laptops are similar, and size and capacity are similar,
so this will be equivalent. We want to know that hotels and
motels are similar in meaning. And the problem is that if we're
using one-hot vector encodings, they have no natural notion of similarity. So if we take these two vectors and say, what is the dot product between
those vectors, it's zero. They have no inherent
notion of similarity. And, something I just wanna stress
a little, since this is important, is note this problem of symbolic
encoding applies not only to traditional rule base logical approaches to
natural language processing, but it also applies to basically all of
the work that was done in probabilistic statistical conventional machine learning
base natural language processing. Although those Latin models normally had
real numbers, they had probabilities of something occurring in the context of
something else, that nevertheless, they were built over
symbolic representations. So that you weren't having any kind of
capturing relationships between words and the models,
each word was a nation to itself.

Okay, so that's bad, and
we have to do something about it. Now, as I've said, there's more than
one thing that you could do about it. And so, one answer is to say, okay gee, we need to have a similarity
relationship between words. Let's go over here and start building completely separately
a similarity relationship between words. And, of course, you could do that. But I'm not gonna talk about that here. What instead I'm going to talk about and
suggest is that what we could do is we could
explore this direct approach, where the representation of
a word encodes its meaning in such a way that you can
just directly read off from these representations,
the similarity between words. So what we're gonna do is
have these vectors and do something like a dot product. And that will be giving us a sense
of the similarity between words. Okay, so how do we go about doing that? And so the way we gonna go about
doing that is by making use of this very simple, but
extremely profound and widely used, NLP idea called distributional similarity. So this has been a really powerful notion.

So the notion of distributional similarity
is that you can get a lot of value for representing the meaning of a word
by looking at the context in which it appears and
doing something with those contexts. So, if I want to know what
the word banking means, what I'm gonna do is find thousands of
instances of the word banking in text and I'm gonna look at the environment
in which each one appeared. And I'm gonna see debt problems,
governments, regulation, Europe, saying unified and I'm gonna start counting up all of these
things that appear and by some means, I'll use those words in the context
to represent the meaning of banking. The most famous slogan that you
will read everywhere if you look into distributional similarity is this one
by JR Firth, who was a British linguist, who said, you shall know a word
by the company it keeps. But this is also really exactly the same
notion that Wittgenstein proposed in his later writings where he
suggested a use theory of meaning. Where, somewhat controversially,
this not the main stream in semantics, he suggested that the right way to
think about the meaning of words is understanding their uses in text.

So, essentially,
if you could predict which textual context the word would appear in, then you
understand the meaning of the word. Okay, so that's what we're going to do. So what we want to do is say for
each word we're going to come up for it a vector and that dense vector
is gonna be chosen so that it'll be good at predicting other words
that appear in the context of this word. Well how do we do that? Well, each of those other words will also
have a word that are attached to them and then we'll be looking at sort
of similarity measures like dot product between those two vectors.

And we're gonna change
them as well to make it so that good at being able to be predicted. So it all kind off gets a little
bit recursive or circular, but we're gonna come up with this
clever algorithm to do that, so that words will be able to predict
their context words and vice-versa. And so I'm gonna go on and
say a little bit more about that. But let me just underline one bit of terminology that was
appearing before in the slide. So we saw two keywords. One was distributional, which was here. And then we've had
distributed representations where we have these dense vectors to
represent the meaning of the words.

Now people tend to
confuse those two words. And there's sort of two
reasons they confuse them. One is because they both start with
distribute and so they're kind of similar. And the second reason people confuse them
is because they very strongly co-occur,. So that distributed representations and
meaning have almost always, up until now, been built by
using distributional similarity. But I did just want people to gather
that these are different notions, right? So the idea of distributional similarity
is a theory about semantics of word meaning that you can describe the meaning
of words by as a use theory of meaning, understanding the context
in which they appear. So distributional contrasts with,
way back here when I said but didn't really explain,
denotational, right? The denotational idea of
word meaning is the meaning of glasses is the set of pairs of
glasses that are around the place.

That's different from
distributional meaning. And distributed then contrasts
with our one-hot word vector. So the one-hot word vectors are localist
representation where you're storing in one place. You're saying here is the symbol glasses. It's stored right here whereas
in distributed representations we're smearing the meaning of
something over a large vector space. Okay, so that's part one. And we're now gonna sorta be heading into
part two, which is what is Word2vec? Okay, and so
I'll go almost straight into this. But this is sort of the recipe in
general for what we're doing for learning neural word embeddings. So we're gonna define a model
that aims to predict between a center word and
words that appear in it's context. Kind of like we are here,
the distributional wording. And we'll sort of have some,
perhaps probability measure or predicts the probability of
the context given the words.

And then once we have that we can
have a loss function as to whether we do that prediction well. So ideally we'd be able to perfectly
predict the words around the word so the minus t means the words that aren't
word index t so the words around t. If we could predict those perfectly
from t we'd have probability one so we'd have no loss but
normally we can't do that. And if we give them probability a quarter
then we'll have sort of three quarters loss or something, right? So we'll have a loss function and we'll sort of do that in many
positions in a large corpus.

And so our goal will be to change
the representations of words so as to minimize our loss. And at this point sort
of a miracle occurs. It's sort of surprising, but
true that you can do no more than set up this kind of
prediction objective. Make it the job of every words word
vectors to be such that they're good at predicting their words that
appear in their context or vice versa. You just have that very simple goal and
you say nothing else about how this is gonna be achieved, but you just pray and
depend on the magic of deep learning. And this miracle happens and outcome these word vectors that
are just amazingly powerful at representing the meaning of words and
are useful for all sorts of things. And so that's where we want to get into
more detail and say how that happens. Okay. So that representation was
meant to be meaning all words apart from the wt, yes,
what is this w minus t mean? I'm actually not gonna use that
notation again in this lecture.

But the w minus t, minus is sometimes
used to mean everything except t. So wt is my focus word, and w minus
t is in all the words in the context. Okay, so this idea that you can
learn low dimensional vector representations is an idea that
has a history in neural networks. It was certainly present in the 1980s, parallel distributed processing
era including work by Rumelhart on learning representations
by back-propagating errors. It really was demonstrated for
word representations in this pioneering early paper by Yoshua Bengio in 2003 and
neural probabilistic language model. I mean, at the time, sort of not so many
people actually paid attention to this paper, this was sort of before
the deep learning boom started. But really this was the paper
where the sort of showed how much value you could get from having
distributed representations of words and be able to predict other words in context.

But then as things started to take off
that idea was sort of built on and revived. So in 2008, Collobert and Weston started
in the sort of modern direction by saying, well, if we just want good word
representations, we don't even have to necessarily make a probabilistic
language model that can predict, we just need to have a way of
learning our word representations. And that's something that's then being
continued in the model that I'm gonna look at now, the word2vec model.

That the emphasis of the word2vec model
was how can we build a very simple, scalable, fast to train model
that we can run over billions of words of text that will produce
exceedingly good word representations. Okay, word2vec, here we come. The basic thing word2vec is trying
to do is use theory of meaning, predict between every word and
its context words. Now word2vec is a piece of software,
I mean, actually inside word2vec it's kind
of a sort of a family of things. So there are two algorithms inside it for
producing word vectors and there are two moderately
efficient training methods. So for this class what I'm
going to do is tell you about one of the algorithms which
is a skip-gram method and about neither of the moderately
efficient training algorithms. Instead I'm gonna tell you about
the hopelessly inefficient training algorithm but is sort of the conceptual
basis of how this is meant to work and that the moderately efficient ones,
which I'll mention at the end.

And then what you'll have to do to
actually make this a scalable process that you can run fast. And then, today is also the day
when we're handing out assignment one and Major part of what you
guys get to do in assignment one is to implement one of
the efficient training algorithms, and to work through the method one of
those efficient training algorithms. So this is the picture
of the skip-gram model. So the idea of the skip-gram model is for each estimation step,
you're taking one word as the center word. So that's here, is my word banking and
then what you're going to do is you're going to try and predict words
in its context out to some window size. And so, the model is going to define
a probability distribution that is the probability of a word appearing
in the context given this center word. And we're going to choose vector
representations of words so we can try and
maximize that probability distribution. And the thing that we'll come back to. But it's important to realize is there's
only one probability distribution, this model.

It's not that there's
a probability distribution for the word one to the left and the word
one to the right, and things like that. We just have one probability
distribution of a context word, which we'll refer to as the output,
because it's what we, produces the output, occurring in
the context close to the center word. Is that clear? Yeah, okay. So that's what we kinda wanna do so
we're gonna have a radius m and then we're going to predict
the surrounding words from sort of positions m before our center
word to m after our center word. And we're gonna do that a whole bunch
of times in a whole bunch of places. And we want to choose word vectors such as that we're maximizing
the probability of that prediction. So what our loss function or objective
function is is really this J prime here. So the J prime is saying we're going to,
so we're going to take a big long amount of text, we take the whole
of Wikipedia or something like that so we got big long sequence of words, so
there are words in the context and real running text, and we're going to
go through each position in the text.

And then, for each position in the text,
we're going to have a window of size 2m around it,
m words before and m words after it. And we're going to have a probability
distribution that will give a probability to a word appearing in
the context of the center word. And what we'd like to do is set
the parameters of our model so that these probabilities
of the words that actually do appear in the context of the center
word are as high as possible.

So the parameters in this model of these
theta here that I show here and here. After this slide,
I kinda drop the theta over here. But you can just assumed
that there is this theta. What is this theta? What is theta is? It's going to be the vector
representation of the words. The only parameters in this model of
the vector representations of each word. There are no other parameters whatsoever
in this model as you'll see pretty quickly. So conceptually this is
our objective function. We wanna maximize the probability
of this predictions. In practice, we just slightly tweak that. Firstly, almost unbearably when
we're working with probabilities and we want to do maximization, we actually
turn things into log probabilities cuz then all that products turn into sums and our math gets a lot easier to work with
and so that's what I've done down here. Good points. And the question is, hey, wait a minute
you're cheating, windows size, isn't that a parameter of the model? And you are right,
this is the parameter of the model.

So I guess I was a bit loose there. Actually, it turns out that there are
several hyper parameters of the model, so I did cheat. It turns out that there are a few
hyper parameters of the model. One is Windows sized and it turns out
that we'll come across a couple of other fudge factors later in the lecture. And all of those things are hyper
parameters that you could adjust. But let's just ignore those for
the moment, let's just assume those are constant. And given those things
aren't being adjusted, the only parameters in the model,
the factor representations of the words. What I'm meaning is that there's
sort of no other probability distribution with its own parameters. That's a good point. I buy that one. So we've gone to the log probability and
the sums now and, and then rather than having
the probability of the whole corpus, we can sort of take the average over
each positions so I've got 1 on T here.

And that's just sort of a making it per
word as sort of a kinda normalization. So that doesn't affect what's the maximum. And then, finally,
the machine learning people really love to minimize things
rather than maximizing things. And so, you can always swap
between maximizing and minimizing, when you're in plus minus land, by
putting a minus sign in front of things. And so, at this point,
we get the negative log likelihood, the negative log probability
according to our model. And so, that's what we will be formally minimizing as our objective function.

So if there were objective function, cost
function, loss function, all the same, this negative log likelihood criterion
really that means that we're using this our cross-entropy loss which is
gonna come back to this next week so I won't really go through it now. But the trick is since we
have a one hot target, which is just predict the word
that actually occurred. Under that criteria the only
thing that's left in cross entropy loss is the negative
probability of the true class. Well, how are we gonna actually do this? How can we make use of
these word vectors to minimize that negative log likelihood? Well, the way we're gonna
do it is we're gonna come with the probably
distribution of context word, given the center word, which is
constructed out of our word vectors.

And so, this is what our probability
distribution is gonna look like. So just to make sure we're clear on
the terminology I'm gonna use forward from here. So c and o are indices in the space
of the vocabulary, the word types. So up here, the t and the t plus j, where
in my text there are positions in my text. Those are sort of words,
763 in words 766 in my text. But here o and c in my vocabulary
words I have word types and so I have my p for words 73 and
47 in my vocabulary words. And so, each word type they're going to
have a vector associated with them so u o is the vector associated
with context word in index o and vc is the vector that's
associated with the center word. And so, how we find this probability
distribution is we're going to use this, what's called a Softmax form,
where we're taking dot products between the the two word vectors and then we're
putting them into a Softmax form. So just to go through that kind
of maximally slowly, right? So we've got two word vectors and
we're gonna dot product them, which means that we so
take the corresponding terms and multiply them together and
sort of sum them all up.

So may adopt product is sort of like
a loose measure of similarity so the contents of the vectors
are more similar to each other the number will get bigger. So that's kind of a similarity
measure through the dot product. And then once we've worked out
dot products between words we're then putting it
in this Softmax form. So this Softmax form is a standard way to turn numbers into
a probability distribution. So when we calculate dot products,
they're just numbers, real numbers. They could be minus 17 or 32. So we can't directly turn those
into a probability distribution so an easy thing that we can
do is exponentiate them. Because if you exponentiate things
that puts them into positive land so it's all gonna be positive. And that's a good basis for
having a probability distribution. And if you have a bunch of numbers that
come from anywhere that are positive and you want to turn them into a probability
distribution that's proportional to the size of those numbers,
there's a really easy way to do that. Which is you sum all the numbers together
and you divide through by the sum and that then instantly gives you
a probability distribution.

So that's then denominated that is
normalizing to give a probability and so when you put those together, that then
gives us this form that we're using as our Softmax form which is now
giving us a probability estimate. So that's giving us this
probability estimate here built solely in terms of
the word vector representations. Is that good? Yeah. That is an extremely good question and
I was hoping to delay saying that for just a minute but you've asked and
so I will say it. Yes, you might think that one word should
only have one vector representation. And if you really wanted to you could
do that, but it turns out you can make the math considerably easier by
saying now actually each word has two vector representation that has one vector
representation when it synthesis the word. And it has another vector representation
when it's a context word.

So that's formally what we have here. So the v is the center word vectors,
and the u are the context word vectors. And it turns out not only does
that make the math a lot easier, because the two
representations are separated when you do optimization rather
than tied to each other. It's actually in practice empirically
works a little better as well, so if your life is easier and
better, who would not choose that? So yes, we have two vectors for each word. Any other questions? Yeah, so the question is,
well wait a minute, you just said this was a way to
make everything positive, but actually you also simultaneously
screwed with the scale of things a lot.

And that's true, right? The reason why this is called a Softmax
function is because it's kind of close to a max function,
because when you exponentiate things, the big things get way bigger and
so they really dominate. And so this really sort of blows out
in the direction of a max function, but not fully. It's still a sort of a soft thing. So you might think that
that's a bad thing to do.

Doing things like this is the most
standard underlying a lot of math, including all those super
common logistic regressions, you see another class's
way of doing things. So it's a good way to know, but people have certainly worked
on a whole bunch of other ways. And there are reasons that you might
think they're interesting, but I won't do them now. Yes? Yeah, so the question was,
when I'm dealing with the context words, am I paying attention to where they are or
just their identity? Yeah, where they are has nothing
to do with it in this model. It's just, what is the identity of
the word somewhere in the window? So there's just one
probability distribution and one representation of the context word.

Now you know, it's not that
that's necessarily a good idea. There are other models which absolutely
pay attention to position and distance. And for some purposes,
especially more syntactic purposes rather than semantic purposes,
that actually helps a lot. But if you're sort of more interested
in just sort of word meaning, it turns out that not paying attention to position actually tends to
help you rather than hurting you. Yeah. Yeah, so the question is how, wait
a minute, is there a unique solution here? Could there be different rotations
that would be equally good? And the answer is yes, there can be. I think we should put off discussing
this cuz actually there's a lot to say about optimization in neural networks,
and there's a lot of exciting new work.

And the one sentence headline is
it's all good news, people spent years saying that minimal work ought to be
a big problem and it turns out it's not. It all works. But I think we better off talking
about that in any more detail. Okay, so yeah this is my picture of what the skip
gram model ends up looking like. It's a bit confusing and hard to read, but also I've got it thrown
from left to right. Right, so we have the center
word that's a one hot vector. We then have a matrix of
the representations of center words. So if we kind of do a multiplication
of this matrix by that vector.

We just sort of actually select
out the column of the matrix which is then the representation
of the center word. Then what we do is we have a second matrix which stores the representations
of the context words. And so for each position in the context, I show three here because
that was confusing enough. We're going to multiply
the vector by this matrix which is the context word representations. And so
we will be picking out sort of the dot products of the center word
with each context word. And it's the same matrix for
each position, right? We only have one context word matrix.

And then these dot products, we're gonna soft max then turn
into a probability distribution. And so our model, as a generative model,
is predicting the probability of each word appearing in the context given
that a certain word is the center word. And so if we are actually using
it generatively, it would say, well, the word you should
be using is this one here. But if there is sort of actual ground
truth as to what was the context word, we can sort of say, well, the actual
ground truth was this word appeared.

And you gave a probability
estimate of 0.1 to that word. And so that's the basis, so if you
didn't do a great job at prediction, then there's going to be some loss, okay? But that's the picture of our model. Okay, and so what we wanna do is now learn parameters, these word vectors,
in such a way that we do as good a job at prediction
as we possibly can. And so standardly when we do these things,
what we do is we take all the parameters in our model
and put them into a big vector theta. And then we're gonna say we're gonna do
optimization to change those parameters so as to maximize objective
function of our model. So what our parameters are is that for each word, we're going to have
a little d dimensional vector, when it's a center word and
when it's a context word. And so
we've got a vocabulary of some size. So we're gonna have a vector for
aardvark as a context word, a vector for art as a context word.

We're going to have a vector
of aardvark as a center word, a vector of art as a center word. So our vector in total is
gonna be of length 2dV. There's gonna be a big long vector that
has everything that was in what was shown in those matrices before. And that's what we then gonna
be saying about optimizing. And so after the break, I'm going to be so going through concretely how
we do that optimization. But before the break, we have the intermission with
our special guest, Danqi Chen. >> Hi, everyone. I'm Danqi Chen, and
I'm the head TA of this class. So today I will start our first
research highlight session, and I will introduce you
a paper from Princeton. The title is A Simple but Tough-to-beat
Baseline for Sentence Embeddings. So today we are learning the word
vector representations, so we hope these vectors can
encode the word meanings.

But our central question in natural
language processing, and also this class, is that how we could have the vector
representations that encode the meaning of sentences like,
natural language processing is fun. So with these sentence representations,
we can compute the sentence similarity using
the inner product of the two vectors. So, for example, Mexico wishes to
guarantee citizen's safety, and, Mexico wishes to avoid more violence. So we can use the vector
representation to predict these two sentences are pretty similar. We can also use this sentence
representation to use as features to do some sentence
classification task. For example, sentiment analysis. So given a sentence like,
natural language processing is fun, we can put our classifier on top
of the vector representations and predict if sentiment is positive. Hopefully this is right, so. So there are a wide range of
measures that compose word vector representations into sentence
vector representations. So the most simple way is
to use the bag-of-words. So the bag-of-words is just like
the vector representation of the natural language processing. It's a average of the three single
word vector representations, the natural, language, and processing.

Later in this quarter, we'll learn a bunch
of complex models, such as recurrent neural nets, the recursing neural nets,
and the convolutional neural nets. But today, for this paper from Princeton,
I want to introduce that this paper introduces a very
simple unsupervised method. That is essentially just
a weighted bag-of-words sentence representation plus remove
some special direction. I will explain this. So they have two steps.

So the first step is that just like how
we compute the average of the vector representations, they also do this,
but each word has a separate weight. Now here, a is a constant. And the p(w),
it means the frequency of this word. So this basically means that the average representation down
weight the frequent words. That's the very simple Step 1. So for the Step 2, after we compute
all of these sentence vector representations, we compute
the first principal components and also subtract the projections onto
this first principle component. You might be familiar with this
if you have ever taken CS 229 and also learned PCA. So that's it. That's their approach. So in this paper,
they also give a probabilistic interpretation about why
they want to do this.

So basically, the idea is that given the
sentence representation, the probability of the limiting or single word, they're
related to the frequency of the word. And also related to how close the word is
related to this sentence representation. And also there's a C0 term that
means common discourse vector. That's usually related to some syntax. So, finally, the results. So first, they take context parents
on the sentence similarity and they show that this simple approach
is much better than the average of word vectors, all the TFIDF rating, and also all the performance of
other sophisticated models.

And also for some supervised tasks
like sentence classification, they're also doing pretty well,
like the entailment and sentiment task. So that's it, thanks. >> Thank you. [LAUGH]
>> [APPLAUSE] >> Okay, Okay, so, and we'll go back from there. All right, so now we're wanting to sort
of actually work through our model. So this is what we had, right? We had our objective function where we
wanna minimize negative log likelihood. And this is the form of the probability
distribution up there, where we have these sort of word vectors with both center
word vectors and context word vectors. And the idea is we want to change
our parameters, these vectors, so as to minimize the negative log likelihood
item, maximize the probability we predict. So if that's what we want to do, how can we work out how
to change our parameters? Gradient, yes,
we're gonna use the gradient.

So, what we're gonna have to do
at this point is to start to do some calculus to see how
we can change the numbers. So precisely, what we'll going
to want to do is to say, well, we have this term for
working out log probabilities. So, we have the log of the probability
of the word t plus j word t.

Well, what is the form of that? Well, we've got it right here. So, we have the log of v
maybe I can save a line. We've got this log of this. And then, what we're gonna want to do is
that we're going to want to change this so that we have, I'm sorry,
minimized in this objective. So, let's suppose we sort of
look at these center vectors. So, what we're gonna want to do is start
working out the partial derivatives of this with respect to the center
vector which is then, going to give us, how we can go about working out,
in which way to change this vector to minimize our objective function. Okay, so, we want to deal with this. So, what's the first thing we can
do with that to make it simpler? Subtraction, yeah.

So, this is a log of a division so, we can
turn that into a log of a subtraction, and then, we can do the partial
derivatives separately. So, we have the derivative with Vc of the log of the exp of u0^T vc and then, we've got minus the log of the sum of w equals 1 to V of exp of u w^T vc. And at that point,
we can separate it into two pieces, right, cuz when there's addition or
subtraction we can do them separately. So, we can do this piece 1 and
we can do the, work out the partial
derivatives of this piece 2. So, piece 1 looks kind of easy so,
let's start here. So, what's the first thing I
should do to make this simpler? Easy question. Cancel some things out, log and x inverses
of each other so, they can just go away. So, for 1,
we can say that this is going to be the partial derivative with
respect to Vc of u0^T vc. Okay, that's looking kind of simpler so, what is the partial derivative of this with respect to vc? u0, so, this just comes out as u0.

Okay, and so, I mean, effectively, this is
the kind of level of calculus that you're gonna have to be able to do to be okay on
assignment one that's coming out today. So, it's nothing that life threatening,
hopefully, you've seen this before. But nevertheless, we are here using
calculus with vectors, right? So, vc here is not just a single number,
it's a whole vector. So, that's sort of the Math 51,
CME 100 kind of content. Now, if you want to,
you can pull it all apart. And you can work out
the partial derivative with respect to Vc, some index, k. And then, you could have this as the sum of l = 1 to d of (u0)l (Vc)l. And what will happen then is if you're
working out of with respect to only one index, then, all of these terms will go
away apart from the one where k equals l.

And you'll sort of end up with
that being the (uo)k term. And I mean, if things get confusing and
complicated, I think it can actually, and your brain is small like mine, it can
actually be useful to sort of go down to the level of working it out with real
numbers and actually have all the indices there and you can absolutely do that and
it comes out the same. But a lot of the time it's sort
of convenient if we can just stay at this vector level and
work out vector derivatives, okay. So, now, this was the easy part and we've got it right there and
we'll come back to that, okay. So then, the trickier part is we then,
go on to number 2. So now, if we just ignore the minus sign for a little bit, so, we'll subtract it afterwards, we've then got the partial derivatives with respect to vc of the log of the sum from w = 1 to v of the exp of uw^T vc,
okay. Well, how can we make
progress with this half? Yeah, so that's right,
before you're going to do that? The chain rule, okay, so, our key tool
that we need to know how to use and we'll just use everywhere
is the chain rule, right? So, neural net people talk all
the time about backpropagation, it turns out that backpropagation
is nothing more than the chain rule with some efficient storage
of partial quantities so that you don't keep on calculating
the same quantity over and over again.

So, it's sort of like chain
rule with memorization, that is the backpropagation algorithm. So, now, key tool is the chain rule so,
what is the chain rule? So, within saying, okay, well, what overall are we going to have
is some function where we're taking f(g(u)) of something. And so, we have this inside part z and so, what we're going to be doing is that
we're going to be taking the derivative of the outside part then,
with the value of the inside. And then, we're gonna be taking
the derivative of the inside part So for this here, so the outside part,
here's our F. And then here's our inside part Z. So the outside part is F,
which is a log function. And so the derivative of a log
function is the one on X function.

So that we're then gonna be having that this is 1 over the sum of w equals 1 to V of the exp of uw^T vc. And then we're going to be multiplying
it by, what do we get over there. So we get the partial
derivative with respect to With respect to vc, of This inside part. The sum of, and it's a little trickier. We really need to be careful of indices so we're gonna get in the bad mess if
we have W here, and we reuse W here.

We really need to change
it into something else. So we're gonna have X equals 1 to V. And then we've got the exp of UX, transpose VC. So that's made a little bit of progress. We want to make a bit more progress here. So what's the next thing we're gonna do. Distribute the derivative. This is just adding some stuff. We can do the same trick of we can do
each part of the derivative separately. So X equals 1 to big V of
the partial derivative with respect to VC of the exp of ux^T vc. Okay, now we wanna keep
going What can we do next. The chain rule again. This is also the form of here's our F and
here's our inner values V which is in
turn sort of a function.

Yeah, so we can apply the chain
rule a second time and so we need the derivative of X. What's the derivative of X. X, so this part here is gonna be staying. The sum of X equals 1 to V
of the partial derivative. Hold on no. Not that one, moving that inside. So it's still exp at its value of UX T VC. And then we're having the partial derivative with respect to VC of UXT VC. And then we've got a bit
more progress to make. So we now need to work out what this is. So what's that. Right, so
that's the same as sort of back over here.

At this point this is just going to be,
that' s coming out as UX. And here we still have the sum of X equals 1 to V of the X of UX T VC. So at this point we kind of wanna
put this together with that. Cuz we're still, I stopped writing that. But we have this one over the sum of W equals 1 to V of the exp of UW, transpose VC. Can we put those things together
in a way that makes it prettier. So I can move this inside this sum. Cuz this is just the sort of number that's
a multiplier that's distributed through. And in particular when I do that,
I can start to sort of notice this interesting
thing that I'm going to be reconstructing a form that
looks very like this form. Sorry, leaving this part up aside. It looks very like the Softmax
form that I started off with.

And so I can then be saying that this is the sum from X equals 1 to V of the exp of UX transpose VC over the sum of W equals 1 to V. So this is where it's important that I
have X and W with different variables of the X of U W transpose VC times U of X. And so well, at that point,
that's kind of interesting cuz, this is kind of exactly the form
that I started of with, for my softmax probability distribution. So what we're doing is we. What we're doing is that that part is then being the sum over X equals one to V of the probability of [INAUDIBLE]. It was wait. The probability of O given the probability of X given C times UX. So that's what we're getting
from the denominator. And then we still had the numerator. The numerator was U zero. What we have here is our
final form is U0 minus that. And if you look at this a bit
it's sort of a form that you always get from these
softmax style formulations.

So this is what we observed. There was the actual output
context word appeared. And this has the form of an expectation. So what we're doing is right here. We're calculating expectation
though we're working out the probability of every possible
word appearing in the context, and based on that probability we get
taking that much of that UX. So this is in some,
this is the expectation vector. It's the average over all
the possible context vectors, weighted by their
likelihood of occurrence. That's the form of our derivative. What we're going to want to be doing is
changing the parameters in our model. In such a way that these become
equal cause that's when we're then finding the maximum and
minimum for us to minimize.

[INAUDIBLE] Okay and so that gives
us the derivatives in that model. Does that make sense? Yeah, that's gonna be question. Anyway, so precisely doing things like this is what
will expect you to do for assignment one. And I'll take the question, but
let me just mention one point. So in this case,
I've only done this for the VC, the center vectors. We do this to every
parameter of the model. In this model, our only other
parameters are the context vectors. We're also gonna do it for those. It's very similar cuz if you look
at the form of the equation, there's a certain
symmetry between the two.

But we're gonna do it for that as well but
I'm not gonna do it here. That's left to you guys. Question. Yeah.
>> [INAUDIBLE] >> From here to here. Okay. So. So, right, so this is a sum right? And this is just the number
at the end of the day. So I can divide every term in
this sum through by that number. So that's what I'm doing. So now I've got my sum with every term
in that divided through by this number. And then I say, wait a minute,
the form of this piece here is precisely my softmax
probably distribution, where this is the probability
of x given C. And so then I'm just rewriting
it as probability of x given c. Where that is meaning,
I kind of did double duty here. But that's sort of meaning that you're
using this probability of x given c using this probability form. >> [INAUDIBLE] >> Yeah, the probability that x occurs as
a context word of center word c.

>> Well, we've just assumed some
fixed window size M. So maybe our window size is five and so
we're considering sort of ten words, five to the left, five to the right. So that's a hypergrameter,
and that stuff's nowhere. We're not dealing with that, we just
assume that God's fixed that for us.

The problem, so
it's done at each position. So for any position, and
all of them are treated equivalently, for any position,
the probability that word x is the word that occurs
within this window at any position given
the center word was of C. Yeah? >> [INAUDIBLE] >> All right, so the question is, why do we choose the dot
product as our basis for coming up with this probability measure? And you know I think the answer
is there's no necessary reason, that there are clearly other things that you could have done and might do. On the other hand,
I kind of think in terms of Vector Algebra it's sort of the most
obvious and simple thing to do. Because it's sort of a measure of
the relatedness and similarity.

I mean I sort of said loosely it was
a measure of similarity between vectors. Someone could have called me on that
because If you say, well wait a minute. If you don't control for
the scale of the vectors, you can make that number as big
as you want, and that is true. So really the common measure of similarity
between vectors is the cosine measure. Where what you do is in the numerator. You take a dot product and then you divide
through by the length of the vectors. So you've got scale and variance and you can't just cheat by
making the vectors bigger. And so, that's a bigger,
better measure of similarity. But to do that you have to
do a whole lot more math and it's not actually necessary here
because since you're sort of predicting every word
against every other word. If you sort of made one
vector very big to try and make some probability
of word k being large.

Well the consequence would be it would
make the probability of every other word be large as well. So you kind of can't cheat
by lengthening the vectors. And therefore you can get away with
just using the dot product as a kind of a similarity measure. Does that sort of satisfy? So yes. I mean, it's not necessary, right? And if we were going to argue,
you could sort of argue with me and say no look, this is crazy,
because by construction, this means the most likely word to appear
in the context of a word is itself.

That doesn't seem like a good result, [LAUGH] because presumably
different words occur. And you could then go from there and say
well no let's do something more complex. Why don't we put a matrix to mediate
between the two vectors to express what appears in the context of each other,
it turns out you don't need to. Now one thing of course is since we
have different representations for the context and center word vectors, it's
not necessarily true that the same word would be highest because there're
two different representations. But in practice they often have a lot
of similarity between themselves not really that that's the reason. It's more that it's sort
of works out pretty well. Because although it is true
that you're not likely to get exactly the same word in the context, you're actually very likely to get words
that are pretty similar in meaning. And are strongly associated and when
those words appear as the center word, you're likely to get your
first word as a context word. And so at a sort of a macro level,
you are actually getting this effect that the same
words are appearing on both sides.

More questions, yeah,
there are two of them. I don't know. Do I do the behind person first and
then the in front person? [LAUGH] So I haven't yet done gradient descent. And maybe I should do that in a minute and
I will see try then. Okay?
>> [INAUDIBLE] >> Yeah >> So that truth is well, we've just clicked to
the huge amount text. So if our word at any position, we know
what are the five words to the left and the five words to the right and
that's the truth. And so we're actually giving some
probability estimate to every word appearing in that context and we can say, well, actually the word
that appeared there was household. What probability did you give to that and
there's some answer. And so, that's our truth. Time is running out, so maybe I'd sort
of just better say a little bit more before we finish which is sort of
starting to this optimization. So this is giving us our derivatives,
we then want to use our derivatives to be able to
work out our word vectors.

And I mean, I'm gonna spend
a super short amount time on this, the hope is through 221,
229 or similar class. You've seen a little bit of optimization
and you've seen some gradient descent. And so, this is just a very quick review. So the idea is once we have gradient
set at point x that if what we do is we subtract off a little
fraction of the gradient, that will move us downhill
towards the minimum. And so if we then calculate the gradient
there again and subtract off a little fraction of it, we'll sort of
start walking down towards the minimum.

And so,
that's the algorithm of gradient descent. So once we have an objective function and
we have the derivatives of the objective function with respect to all of
the parameters, our gradient descent algorithm would be to say,
you've got some current parameter values. We've worked out the gradient
at that position. We subtract off a little
fraction of that and that will give us new parameter
values which we will expect to be give us a lower objective value,
and we'll walk towards the minimum. And in general, that is true and
that will work. So then, to write that up as Python code, it's really sort of super simple that
you just go in this while true loop. You have to have some stopping condition
actually where you evaluating the gradient of given your objective function, your
corpus and your current parameters, so you have the theta grad and then you're
sort of subtracting a little fraction of the theta grad after the current
parameters and then you just repeat over.

And so the picture is, so the red lines
that are sort of the contour lines of the value of the objective function. And so what you do is when you
calculate the gradient, it's giving you the direction of the steepest descent and
you walk a little bit each time in that direction and you will hopefully
walk smoothly towards the minimum. Now the reason that might not work is
if you actually take a first step and you go from here to over there,
you've greatly overshot the minimum. So, it's important that alpha be small
enough that you're still walking calmly down towards the minimum and
then all work. And so, gradient descent is the most
basic tool to minimize functions. So it's the conceptually first thing to
know, but then the sort of last minute. What I wanted to explain is actually, we might have 40 billion tokens
in our corpus to go through.

And if you have to work out
the gradient of your objective function relative to a 40 billion word corpus,
that's gonna take forever, so you'll wait for an hour before
you make your first gradient update. And so, you're not gonna be able train
your model in a realistic amount of time. So for basically,
all neural nets doing naive batch gradient descent hopeless algorithm,
you can't use that. It's not practical to use. So instead, what we do Is used
stochastic gradient descent. So, the stochastic gradient descent or
SGD is our key tool. And so what that's meaning is, so
we just take one position in the text. So we have one center word and
the words around it and we say, well, let's adjust it at that one
position work out the gradient with respect to all of our parameters. And using that estimate of
the gradient in that position, we'll work a little bit in that direction. If you think about it for
doing something like word vector learning, this estimate of the gradient is
incredibly, incredibly noisy, because we've done it at one position which just
happens to have a few words around it.

So the vast majority of the parameters
of our model, we didn't see at all. So, it's a kind of incredibly
noisy estimate of the gradient. walking a little bit in that direction
isn't even guaranteed to have make you walk downhill,
because it's such a noisy estimate. But in practice, this works like a gem. And in fact, it works better. Again, it's a win, win. It's not only that doing things
this way is orders of magnitude faster than batch gradient descent, because you can do an update after you
look at every center word position.

It turns out that neural
network algorithms love noise. So the fact that this gradient descent,
the estimate of the gradient is noisy, actually helps SGD to work better
as an optimization algorithm and neural network learning. And so, this is what we're
always gonna use in practice. I have to stop there for today even
though the fire alarm didn't go off. Thanks a lot. >> [APPLAUSE].

As found on YouTube