[MUSIC] Stanford University. >> Okay, so let's get going. Welcome back to the second

class of CS224N /Ling 284, Natural Language Processing

with Deep Learning. So this class is gonna be almost

the complete opposite of the last class. So in the last class, it was a very high level picture of

sort of trying from the very top down. Sort of say a little bit about what

is natural language processing, what is deep learning, why it's exciting,

why both of them are exciting and how I'd like to put them together? So for today's class we're gonna go

completely to the opposite extreme. We're gonna go right down

to the bottom of words, and we're gonna have vectors,

and we're gonna do baby math. Now for some of you this will seem

like tedious repetitive baby math. But I think that there are probably

quite a few of you for which having some math review

is just going to be useful. And this is really the sort of foundation

on which everything else builds.

And so if you don't have sort of straight

the fundamentals right at the beginning of how you can use neural networks on the

sort of very simplest kind of structures, it's sort of really all over from there. So what I'd like to do today is

sort of really go slowly and carefully through the foundations of how

you can start to do things with neural networks in this very simple case of

learning representations for words.

And hope that's kind of a good foundation

that we can build on forwards. And indeed that's what we're gonna

keep on doing, building forward. So next week Richard is gonna keep on

doing a lot of math from the ground up to try, and really help get straight some

of the foundations of Deep Learning. Okay, so this is basically the plan. So tiny bit word meaning and no, [LAUGH]. >> [LAUGH]

>> Tiny bit on word meaning then start to introduce this model of learning

word vectors called Word2vec. And this was a model that was

introduced by Thomas Mikolov and colleagues at Google in 2013. And so there are many other

ways that you could think about having representations of words. And next week, Richard's gonna talk

about some of those other mechanisms. But today, I wanna sort of avoid

having a lot of background and comparative commentary. So I'm just gonna present

this one way of doing it. And you'd also pretty study

the good way of doing it, so it's not a bad one to know. Okay, so then after that,

we're gonna have the first or was it gonna be one of

the features of this class.

We decided that all the evidence says that

students can't concentrate for 75 minutes. So we decided we'd sort of mix

it up a little, and hopefully, also give people an opportunity to sort

of get more of a sense of what some of the exciting new work that's coming

out every month in Deep Learning is. And so what we're gonna do is have one TA

each time, do a little research highlight. Which will just be sort of a like

a verbal blog post of telling you a little bit about some recent paper

and why it's interesting, exciting. We're gonna start that today with Danqi. Then after that, I wanna go sort of carefully through the

word to vec objective function gradients.

Refresher little on optimization,

mention the assignment, tell you all about Word2vec

that's basically the plan, okay? So we kinda wonder sort

of have word vectors as I mentioned last time as

a model of word meaning. That's a pretty

controversial idea actually. And I just wanna give kind of a few words

of context before we dive into that and do it anyway. Okay, so

if you look up meaning in a dictionary cuz a dictionary is a storehouse

of word meanings after all. What the Webster's dictionary says is

meaning is the idea that is represented by a word, phrase, etc. The idea that a person wants to express

by using words, signs, etc, etc. In some sense, this is fairly close to what is the commonest linguistic

way of thinking of meaning. So standardly in linguistics,

you have a linguistic sign like a word, and then it has things that

it signifies in the world.

So if I have a word like glasses then

it's got a signification which includes these and there are lots of other pairs of

glasses I can see in front of me, right? And those things that it signifies, the denotation of the term glasses. That hasn't proven to be a notion of

meaning that's been very easy for people to make much use of in computational

systems for dealing with language. So in practice, if you look at what

computational systems have done for meanings of words over

the last several decades. By far the most common thing that's

happened is, people have tried to deal with the meaning of words by

making use of taxonomic resources.

And so if they're English, the most

famous taxonomic resource is WordNet. And it's famous,

maybe not like Websters is famous. But it's famous among

computational linguists. Because it's free to download a copy and that's much more useful than having

a copy of Webster's on your shelf. And it provides a lot of

taxonomy information about words. So this little bit of Python code. This is showing you getting a hold of

word net using the nltk which is one of the main Python packages for nlp. And so then I'm asking it for

the word panda, not the Python package Panda,

the word panda. Then I'm saying, well tell me about the hypernym the kind

of things that it's the kind of. And so for Panda it's sort of heading

up through carnivores, placentals, mammals up into sort of

abstract types like objects. Or on the right hand side,

I'm sort of asking for the word good,

will tell me about synonyms of good.

And part of what your finding there is,

well WordNet is saying, well the word good has different senses. So for each sense, let me tell

you some synonyms for each sense. So one sense, the second one is sort

of the kind of good person sense. And they're suggesting synonyms

like honorable and respectable. But there are other ones here

where this pair is good to eat and that's sort of meaning is ripe. Okay, so

you get this sort of sense of meaning. That's been,

that's been a great resource, but it's also been a resource that

people have found in practice. It's hard to get nearly as much value out

of it as you'd like to get out of it. And why is that? I mean there are a whole bunch of reasons. I mean one reason is that at this level

of this sort of taxonomic relationships, you lose an enormous amount of nuance.

So one of those synonym sets for

good was adept, expert, good, practiced, proficient, skillful. But I mean, it seems like those mean

really different things, right? It seems like saying I'm

an expert at deep learning. Means something slightly different

to I'm good at deep learning. So there's a lot of nuance there. There's a lot of incompleteness in WordNet

so for a lot of the ways that people, Use words more flexibly. So if I say I'm a deep-learning ninja, or something like that,

that that's not in WordNet at all.

What kind of things you put into these

synonym sets ends up very subjective, right? Which sense distinctions you make and

which things you do and don't say are the same,

it's all very unclear. It requires,

even to the extent that it's made, it's required many person

years of human labor. And at the end of the day,

it's sort of, it's kind of hard to get anything accurate out of it

in the way of sort of word similarities.

Like I kind of feel that proficient is

more similar to expert than good, maybe. But you can't get any of this

kind of stuff out of WordNet. Okay, so therefore,

that's sort of something of a problem. And it's part of this

general problem of discrete, or categorical, representations

that I started on last time. So, the fundamental thing

to note is that for sorta just about all NLP,

apart from both modern deep learning and a little bit of neural net work

NLP that got done in the 1980s, that it's all used atomic symbols

like hotel, conference, walk.

And if we think of that from our kind

of jaundiced neural net direction, using atomic symbols is kind of like using big vectors that are zero everywhere

apart from a one and one position. So what we have, is we have a lot of words

in the language that are equivalent to our symbols and we're putting a one

in the position, in the vector, that represents the particular symbol,

perhaps hotel. And these vectors are going to be really,

really long. I mean,

how long depends on how you look at it. So sometimes a speech recognizer

might have a 20,000 word vocabulary. So it'd be that long. But, if we're kinda building

a machine translation system, we might use a 500,000 word vocabulary,

so that's very long. And Google released sort of

a 1-terabyte corpus of web crawl.

That's a resource that's been

widely used for a lot of NLP. And while the size of the vocabulary

in that is 13 million words, so that's really, really long. So, it's a very, very big vector. And so, why are these vectors problematic? I'm sorry, I'm not remembering my slides,

so I should say my slides first. Okay, so this is referred to in

neural net land as one-hot in coding because there's just this

one on zero in the vector. And so, that's the example of

a localist representation. So why is this problematic? And the reason why it's

problematic is it doesn't give any inherent notion of

relationships between words. So, very commonly what we want

to know is when meanings and words and

phrases are similar to each other. So, for example, in a web search

application, if the user searches for Dell notebook battery size, we'd like to match a document that

says Dell laptop battery capacity.

So we sort of want to know that

notebooks and laptops are similar, and size and capacity are similar,

so this will be equivalent. We want to know that hotels and

motels are similar in meaning. And the problem is that if we're

using one-hot vector encodings, they have no natural notion of similarity. So if we take these two vectors and say, what is the dot product between

those vectors, it's zero. They have no inherent

notion of similarity. And, something I just wanna stress

a little, since this is important, is note this problem of symbolic

encoding applies not only to traditional rule base logical approaches to

natural language processing, but it also applies to basically all of

the work that was done in probabilistic statistical conventional machine learning

base natural language processing. Although those Latin models normally had

real numbers, they had probabilities of something occurring in the context of

something else, that nevertheless, they were built over

symbolic representations. So that you weren't having any kind of

capturing relationships between words and the models,

each word was a nation to itself.

Okay, so that's bad, and

we have to do something about it. Now, as I've said, there's more than

one thing that you could do about it. And so, one answer is to say, okay gee, we need to have a similarity

relationship between words. Let's go over here and start building completely separately

a similarity relationship between words. And, of course, you could do that. But I'm not gonna talk about that here. What instead I'm going to talk about and

suggest is that what we could do is we could

explore this direct approach, where the representation of

a word encodes its meaning in such a way that you can

just directly read off from these representations,

the similarity between words. So what we're gonna do is

have these vectors and do something like a dot product. And that will be giving us a sense

of the similarity between words. Okay, so how do we go about doing that? And so the way we gonna go about

doing that is by making use of this very simple, but

extremely profound and widely used, NLP idea called distributional similarity. So this has been a really powerful notion.

So the notion of distributional similarity

is that you can get a lot of value for representing the meaning of a word

by looking at the context in which it appears and

doing something with those contexts. So, if I want to know what

the word banking means, what I'm gonna do is find thousands of

instances of the word banking in text and I'm gonna look at the environment

in which each one appeared. And I'm gonna see debt problems,

governments, regulation, Europe, saying unified and I'm gonna start counting up all of these

things that appear and by some means, I'll use those words in the context

to represent the meaning of banking. The most famous slogan that you

will read everywhere if you look into distributional similarity is this one

by JR Firth, who was a British linguist, who said, you shall know a word

by the company it keeps. But this is also really exactly the same

notion that Wittgenstein proposed in his later writings where he

suggested a use theory of meaning. Where, somewhat controversially,

this not the main stream in semantics, he suggested that the right way to

think about the meaning of words is understanding their uses in text.

So, essentially,

if you could predict which textual context the word would appear in, then you

understand the meaning of the word. Okay, so that's what we're going to do. So what we want to do is say for

each word we're going to come up for it a vector and that dense vector

is gonna be chosen so that it'll be good at predicting other words

that appear in the context of this word. Well how do we do that? Well, each of those other words will also

have a word that are attached to them and then we'll be looking at sort

of similarity measures like dot product between those two vectors.

And we're gonna change

them as well to make it so that good at being able to be predicted. So it all kind off gets a little

bit recursive or circular, but we're gonna come up with this

clever algorithm to do that, so that words will be able to predict

their context words and vice-versa. And so I'm gonna go on and

say a little bit more about that. But let me just underline one bit of terminology that was

appearing before in the slide. So we saw two keywords. One was distributional, which was here. And then we've had

distributed representations where we have these dense vectors to

represent the meaning of the words.

Now people tend to

confuse those two words. And there's sort of two

reasons they confuse them. One is because they both start with

distribute and so they're kind of similar. And the second reason people confuse them

is because they very strongly co-occur,. So that distributed representations and

meaning have almost always, up until now, been built by

using distributional similarity. But I did just want people to gather

that these are different notions, right? So the idea of distributional similarity

is a theory about semantics of word meaning that you can describe the meaning

of words by as a use theory of meaning, understanding the context

in which they appear. So distributional contrasts with,

way back here when I said but didn't really explain,

denotational, right? The denotational idea of

word meaning is the meaning of glasses is the set of pairs of

glasses that are around the place.

That's different from

distributional meaning. And distributed then contrasts

with our one-hot word vector. So the one-hot word vectors are localist

representation where you're storing in one place. You're saying here is the symbol glasses. It's stored right here whereas

in distributed representations we're smearing the meaning of

something over a large vector space. Okay, so that's part one. And we're now gonna sorta be heading into

part two, which is what is Word2vec? Okay, and so

I'll go almost straight into this. But this is sort of the recipe in

general for what we're doing for learning neural word embeddings. So we're gonna define a model

that aims to predict between a center word and

words that appear in it's context. Kind of like we are here,

the distributional wording. And we'll sort of have some,

perhaps probability measure or predicts the probability of

the context given the words.

And then once we have that we can

have a loss function as to whether we do that prediction well. So ideally we'd be able to perfectly

predict the words around the word so the minus t means the words that aren't

word index t so the words around t. If we could predict those perfectly

from t we'd have probability one so we'd have no loss but

normally we can't do that. And if we give them probability a quarter

then we'll have sort of three quarters loss or something, right? So we'll have a loss function and we'll sort of do that in many

positions in a large corpus.

And so our goal will be to change

the representations of words so as to minimize our loss. And at this point sort

of a miracle occurs. It's sort of surprising, but

true that you can do no more than set up this kind of

prediction objective. Make it the job of every words word

vectors to be such that they're good at predicting their words that

appear in their context or vice versa. You just have that very simple goal and

you say nothing else about how this is gonna be achieved, but you just pray and

depend on the magic of deep learning. And this miracle happens and outcome these word vectors that

are just amazingly powerful at representing the meaning of words and

are useful for all sorts of things. And so that's where we want to get into

more detail and say how that happens. Okay. So that representation was

meant to be meaning all words apart from the wt, yes,

what is this w minus t mean? I'm actually not gonna use that

notation again in this lecture.

But the w minus t, minus is sometimes

used to mean everything except t. So wt is my focus word, and w minus

t is in all the words in the context. Okay, so this idea that you can

learn low dimensional vector representations is an idea that

has a history in neural networks. It was certainly present in the 1980s, parallel distributed processing

era including work by Rumelhart on learning representations

by back-propagating errors. It really was demonstrated for

word representations in this pioneering early paper by Yoshua Bengio in 2003 and

neural probabilistic language model. I mean, at the time, sort of not so many

people actually paid attention to this paper, this was sort of before

the deep learning boom started. But really this was the paper

where the sort of showed how much value you could get from having

distributed representations of words and be able to predict other words in context.

But then as things started to take off

that idea was sort of built on and revived. So in 2008, Collobert and Weston started

in the sort of modern direction by saying, well, if we just want good word

representations, we don't even have to necessarily make a probabilistic

language model that can predict, we just need to have a way of

learning our word representations. And that's something that's then being

continued in the model that I'm gonna look at now, the word2vec model.

That the emphasis of the word2vec model

was how can we build a very simple, scalable, fast to train model

that we can run over billions of words of text that will produce

exceedingly good word representations. Okay, word2vec, here we come. The basic thing word2vec is trying

to do is use theory of meaning, predict between every word and

its context words. Now word2vec is a piece of software,

I mean, actually inside word2vec it's kind

of a sort of a family of things. So there are two algorithms inside it for

producing word vectors and there are two moderately

efficient training methods. So for this class what I'm

going to do is tell you about one of the algorithms which

is a skip-gram method and about neither of the moderately

efficient training algorithms. Instead I'm gonna tell you about

the hopelessly inefficient training algorithm but is sort of the conceptual

basis of how this is meant to work and that the moderately efficient ones,

which I'll mention at the end.

And then what you'll have to do to

actually make this a scalable process that you can run fast. And then, today is also the day

when we're handing out assignment one and Major part of what you

guys get to do in assignment one is to implement one of

the efficient training algorithms, and to work through the method one of

those efficient training algorithms. So this is the picture

of the skip-gram model. So the idea of the skip-gram model is for each estimation step,

you're taking one word as the center word. So that's here, is my word banking and

then what you're going to do is you're going to try and predict words

in its context out to some window size. And so, the model is going to define

a probability distribution that is the probability of a word appearing

in the context given this center word. And we're going to choose vector

representations of words so we can try and

maximize that probability distribution. And the thing that we'll come back to. But it's important to realize is there's

only one probability distribution, this model.

It's not that there's

a probability distribution for the word one to the left and the word

one to the right, and things like that. We just have one probability

distribution of a context word, which we'll refer to as the output,

because it's what we, produces the output, occurring in

the context close to the center word. Is that clear? Yeah, okay. So that's what we kinda wanna do so

we're gonna have a radius m and then we're going to predict

the surrounding words from sort of positions m before our center

word to m after our center word. And we're gonna do that a whole bunch

of times in a whole bunch of places. And we want to choose word vectors such as that we're maximizing

the probability of that prediction. So what our loss function or objective

function is is really this J prime here. So the J prime is saying we're going to,

so we're going to take a big long amount of text, we take the whole

of Wikipedia or something like that so we got big long sequence of words, so

there are words in the context and real running text, and we're going to

go through each position in the text.

And then, for each position in the text,

we're going to have a window of size 2m around it,

m words before and m words after it. And we're going to have a probability

distribution that will give a probability to a word appearing in

the context of the center word. And what we'd like to do is set

the parameters of our model so that these probabilities

of the words that actually do appear in the context of the center

word are as high as possible.

So the parameters in this model of these

theta here that I show here and here. After this slide,

I kinda drop the theta over here. But you can just assumed

that there is this theta. What is this theta? What is theta is? It's going to be the vector

representation of the words. The only parameters in this model of

the vector representations of each word. There are no other parameters whatsoever

in this model as you'll see pretty quickly. So conceptually this is

our objective function. We wanna maximize the probability

of this predictions. In practice, we just slightly tweak that. Firstly, almost unbearably when

we're working with probabilities and we want to do maximization, we actually

turn things into log probabilities cuz then all that products turn into sums and our math gets a lot easier to work with

and so that's what I've done down here. Good points. And the question is, hey, wait a minute

you're cheating, windows size, isn't that a parameter of the model? And you are right,

this is the parameter of the model.

So I guess I was a bit loose there. Actually, it turns out that there are

several hyper parameters of the model, so I did cheat. It turns out that there are a few

hyper parameters of the model. One is Windows sized and it turns out

that we'll come across a couple of other fudge factors later in the lecture. And all of those things are hyper

parameters that you could adjust. But let's just ignore those for

the moment, let's just assume those are constant. And given those things

aren't being adjusted, the only parameters in the model,

the factor representations of the words. What I'm meaning is that there's

sort of no other probability distribution with its own parameters. That's a good point. I buy that one. So we've gone to the log probability and

the sums now and, and then rather than having

the probability of the whole corpus, we can sort of take the average over

each positions so I've got 1 on T here.

And that's just sort of a making it per

word as sort of a kinda normalization. So that doesn't affect what's the maximum. And then, finally,

the machine learning people really love to minimize things

rather than maximizing things. And so, you can always swap

between maximizing and minimizing, when you're in plus minus land, by

putting a minus sign in front of things. And so, at this point,

we get the negative log likelihood, the negative log probability

according to our model. And so, that's what we will be formally minimizing as our objective function.

So if there were objective function, cost

function, loss function, all the same, this negative log likelihood criterion

really that means that we're using this our cross-entropy loss which is

gonna come back to this next week so I won't really go through it now. But the trick is since we

have a one hot target, which is just predict the word

that actually occurred. Under that criteria the only

thing that's left in cross entropy loss is the negative

probability of the true class. Well, how are we gonna actually do this? How can we make use of

these word vectors to minimize that negative log likelihood? Well, the way we're gonna

do it is we're gonna come with the probably

distribution of context word, given the center word, which is

constructed out of our word vectors.

And so, this is what our probability

distribution is gonna look like. So just to make sure we're clear on

the terminology I'm gonna use forward from here. So c and o are indices in the space

of the vocabulary, the word types. So up here, the t and the t plus j, where

in my text there are positions in my text. Those are sort of words,

763 in words 766 in my text. But here o and c in my vocabulary

words I have word types and so I have my p for words 73 and

47 in my vocabulary words. And so, each word type they're going to

have a vector associated with them so u o is the vector associated

with context word in index o and vc is the vector that's

associated with the center word. And so, how we find this probability

distribution is we're going to use this, what's called a Softmax form,

where we're taking dot products between the the two word vectors and then we're

putting them into a Softmax form. So just to go through that kind

of maximally slowly, right? So we've got two word vectors and

we're gonna dot product them, which means that we so

take the corresponding terms and multiply them together and

sort of sum them all up.

So may adopt product is sort of like

a loose measure of similarity so the contents of the vectors

are more similar to each other the number will get bigger. So that's kind of a similarity

measure through the dot product. And then once we've worked out

dot products between words we're then putting it

in this Softmax form. So this Softmax form is a standard way to turn numbers into

a probability distribution. So when we calculate dot products,

they're just numbers, real numbers. They could be minus 17 or 32. So we can't directly turn those

into a probability distribution so an easy thing that we can

do is exponentiate them. Because if you exponentiate things

that puts them into positive land so it's all gonna be positive. And that's a good basis for

having a probability distribution. And if you have a bunch of numbers that

come from anywhere that are positive and you want to turn them into a probability

distribution that's proportional to the size of those numbers,

there's a really easy way to do that. Which is you sum all the numbers together

and you divide through by the sum and that then instantly gives you

a probability distribution.

So that's then denominated that is

normalizing to give a probability and so when you put those together, that then

gives us this form that we're using as our Softmax form which is now

giving us a probability estimate. So that's giving us this

probability estimate here built solely in terms of

the word vector representations. Is that good? Yeah. That is an extremely good question and

I was hoping to delay saying that for just a minute but you've asked and

so I will say it. Yes, you might think that one word should

only have one vector representation. And if you really wanted to you could

do that, but it turns out you can make the math considerably easier by

saying now actually each word has two vector representation that has one vector

representation when it synthesis the word. And it has another vector representation

when it's a context word.

So that's formally what we have here. So the v is the center word vectors,

and the u are the context word vectors. And it turns out not only does

that make the math a lot easier, because the two

representations are separated when you do optimization rather

than tied to each other. It's actually in practice empirically

works a little better as well, so if your life is easier and

better, who would not choose that? So yes, we have two vectors for each word. Any other questions? Yeah, so the question is,

well wait a minute, you just said this was a way to

make everything positive, but actually you also simultaneously

screwed with the scale of things a lot.

And that's true, right? The reason why this is called a Softmax

function is because it's kind of close to a max function,

because when you exponentiate things, the big things get way bigger and

so they really dominate. And so this really sort of blows out

in the direction of a max function, but not fully. It's still a sort of a soft thing. So you might think that

that's a bad thing to do.

Doing things like this is the most

standard underlying a lot of math, including all those super

common logistic regressions, you see another class's

way of doing things. So it's a good way to know, but people have certainly worked

on a whole bunch of other ways. And there are reasons that you might

think they're interesting, but I won't do them now. Yes? Yeah, so the question was,

when I'm dealing with the context words, am I paying attention to where they are or

just their identity? Yeah, where they are has nothing

to do with it in this model. It's just, what is the identity of

the word somewhere in the window? So there's just one

probability distribution and one representation of the context word.

Now you know, it's not that

that's necessarily a good idea. There are other models which absolutely

pay attention to position and distance. And for some purposes,

especially more syntactic purposes rather than semantic purposes,

that actually helps a lot. But if you're sort of more interested

in just sort of word meaning, it turns out that not paying attention to position actually tends to

help you rather than hurting you. Yeah. Yeah, so the question is how, wait

a minute, is there a unique solution here? Could there be different rotations

that would be equally good? And the answer is yes, there can be. I think we should put off discussing

this cuz actually there's a lot to say about optimization in neural networks,

and there's a lot of exciting new work.

And the one sentence headline is

it's all good news, people spent years saying that minimal work ought to be

a big problem and it turns out it's not. It all works. But I think we better off talking

about that in any more detail. Okay, so yeah this is my picture of what the skip

gram model ends up looking like. It's a bit confusing and hard to read, but also I've got it thrown

from left to right. Right, so we have the center

word that's a one hot vector. We then have a matrix of

the representations of center words. So if we kind of do a multiplication

of this matrix by that vector.

We just sort of actually select

out the column of the matrix which is then the representation

of the center word. Then what we do is we have a second matrix which stores the representations

of the context words. And so for each position in the context, I show three here because

that was confusing enough. We're going to multiply

the vector by this matrix which is the context word representations. And so

we will be picking out sort of the dot products of the center word

with each context word. And it's the same matrix for

each position, right? We only have one context word matrix.

And then these dot products, we're gonna soft max then turn

into a probability distribution. And so our model, as a generative model,

is predicting the probability of each word appearing in the context given

that a certain word is the center word. And so if we are actually using

it generatively, it would say, well, the word you should

be using is this one here. But if there is sort of actual ground

truth as to what was the context word, we can sort of say, well, the actual

ground truth was this word appeared.

And you gave a probability

estimate of 0.1 to that word. And so that's the basis, so if you

didn't do a great job at prediction, then there's going to be some loss, okay? But that's the picture of our model. Okay, and so what we wanna do is now learn parameters, these word vectors,

in such a way that we do as good a job at prediction

as we possibly can. And so standardly when we do these things,

what we do is we take all the parameters in our model

and put them into a big vector theta. And then we're gonna say we're gonna do

optimization to change those parameters so as to maximize objective

function of our model. So what our parameters are is that for each word, we're going to have

a little d dimensional vector, when it's a center word and

when it's a context word. And so

we've got a vocabulary of some size. So we're gonna have a vector for

aardvark as a context word, a vector for art as a context word.

We're going to have a vector

of aardvark as a center word, a vector of art as a center word. So our vector in total is

gonna be of length 2dV. There's gonna be a big long vector that

has everything that was in what was shown in those matrices before. And that's what we then gonna

be saying about optimizing. And so after the break, I'm going to be so going through concretely how

we do that optimization. But before the break, we have the intermission with

our special guest, Danqi Chen. >> Hi, everyone. I'm Danqi Chen, and

I'm the head TA of this class. So today I will start our first

research highlight session, and I will introduce you

a paper from Princeton. The title is A Simple but Tough-to-beat

Baseline for Sentence Embeddings. So today we are learning the word

vector representations, so we hope these vectors can

encode the word meanings.

But our central question in natural

language processing, and also this class, is that how we could have the vector

representations that encode the meaning of sentences like,

natural language processing is fun. So with these sentence representations,

we can compute the sentence similarity using

the inner product of the two vectors. So, for example, Mexico wishes to

guarantee citizen's safety, and, Mexico wishes to avoid more violence. So we can use the vector

representation to predict these two sentences are pretty similar. We can also use this sentence

representation to use as features to do some sentence

classification task. For example, sentiment analysis. So given a sentence like,

natural language processing is fun, we can put our classifier on top

of the vector representations and predict if sentiment is positive. Hopefully this is right, so. So there are a wide range of

measures that compose word vector representations into sentence

vector representations. So the most simple way is

to use the bag-of-words. So the bag-of-words is just like

the vector representation of the natural language processing. It's a average of the three single

word vector representations, the natural, language, and processing.

Later in this quarter, we'll learn a bunch

of complex models, such as recurrent neural nets, the recursing neural nets,

and the convolutional neural nets. But today, for this paper from Princeton,

I want to introduce that this paper introduces a very

simple unsupervised method. That is essentially just

a weighted bag-of-words sentence representation plus remove

some special direction. I will explain this. So they have two steps.

So the first step is that just like how

we compute the average of the vector representations, they also do this,

but each word has a separate weight. Now here, a is a constant. And the p(w),

it means the frequency of this word. So this basically means that the average representation down

weight the frequent words. That's the very simple Step 1. So for the Step 2, after we compute

all of these sentence vector representations, we compute

the first principal components and also subtract the projections onto

this first principle component. You might be familiar with this

if you have ever taken CS 229 and also learned PCA. So that's it. That's their approach. So in this paper,

they also give a probabilistic interpretation about why

they want to do this.

So basically, the idea is that given the

sentence representation, the probability of the limiting or single word, they're

related to the frequency of the word. And also related to how close the word is

related to this sentence representation. And also there's a C0 term that

means common discourse vector. That's usually related to some syntax. So, finally, the results. So first, they take context parents

on the sentence similarity and they show that this simple approach

is much better than the average of word vectors, all the TFIDF rating, and also all the performance of

other sophisticated models.

And also for some supervised tasks

like sentence classification, they're also doing pretty well,

like the entailment and sentiment task. So that's it, thanks. >> Thank you. [LAUGH]

>> [APPLAUSE] >> Okay, Okay, so, and we'll go back from there. All right, so now we're wanting to sort

of actually work through our model. So this is what we had, right? We had our objective function where we

wanna minimize negative log likelihood. And this is the form of the probability

distribution up there, where we have these sort of word vectors with both center

word vectors and context word vectors. And the idea is we want to change

our parameters, these vectors, so as to minimize the negative log likelihood

item, maximize the probability we predict. So if that's what we want to do, how can we work out how

to change our parameters? Gradient, yes,

we're gonna use the gradient.

So, what we're gonna have to do

at this point is to start to do some calculus to see how

we can change the numbers. So precisely, what we'll going

to want to do is to say, well, we have this term for

working out log probabilities. So, we have the log of the probability

of the word t plus j word t.

Well, what is the form of that? Well, we've got it right here. So, we have the log of v

maybe I can save a line. We've got this log of this. And then, what we're gonna want to do is

that we're going to want to change this so that we have, I'm sorry,

minimized in this objective. So, let's suppose we sort of

look at these center vectors. So, what we're gonna want to do is start

working out the partial derivatives of this with respect to the center

vector which is then, going to give us, how we can go about working out,

in which way to change this vector to minimize our objective function. Okay, so, we want to deal with this. So, what's the first thing we can

do with that to make it simpler? Subtraction, yeah.

So, this is a log of a division so, we can

turn that into a log of a subtraction, and then, we can do the partial

derivatives separately. So, we have the derivative with Vc of the log of the exp of u0^T vc and then, we've got minus the log of the sum of w equals 1 to V of exp of u w^T vc. And at that point,

we can separate it into two pieces, right, cuz when there's addition or

subtraction we can do them separately. So, we can do this piece 1 and

we can do the, work out the partial

derivatives of this piece 2. So, piece 1 looks kind of easy so,

let's start here. So, what's the first thing I

should do to make this simpler? Easy question. Cancel some things out, log and x inverses

of each other so, they can just go away. So, for 1,

we can say that this is going to be the partial derivative with

respect to Vc of u0^T vc. Okay, that's looking kind of simpler so, what is the partial derivative of this with respect to vc? u0, so, this just comes out as u0.

Okay, and so, I mean, effectively, this is

the kind of level of calculus that you're gonna have to be able to do to be okay on

assignment one that's coming out today. So, it's nothing that life threatening,

hopefully, you've seen this before. But nevertheless, we are here using

calculus with vectors, right? So, vc here is not just a single number,

it's a whole vector. So, that's sort of the Math 51,

CME 100 kind of content. Now, if you want to,

you can pull it all apart. And you can work out

the partial derivative with respect to Vc, some index, k. And then, you could have this as the sum of l = 1 to d of (u0)l (Vc)l. And what will happen then is if you're

working out of with respect to only one index, then, all of these terms will go

away apart from the one where k equals l.

And you'll sort of end up with

that being the (uo)k term. And I mean, if things get confusing and

complicated, I think it can actually, and your brain is small like mine, it can

actually be useful to sort of go down to the level of working it out with real

numbers and actually have all the indices there and you can absolutely do that and

it comes out the same. But a lot of the time it's sort

of convenient if we can just stay at this vector level and

work out vector derivatives, okay. So, now, this was the easy part and we've got it right there and

we'll come back to that, okay. So then, the trickier part is we then,

go on to number 2. So now, if we just ignore the minus sign for a little bit, so, we'll subtract it afterwards, we've then got the partial derivatives with respect to vc of the log of the sum from w = 1 to v of the exp of uw^T vc,

okay. Well, how can we make

progress with this half? Yeah, so that's right,

before you're going to do that? The chain rule, okay, so, our key tool

that we need to know how to use and we'll just use everywhere

is the chain rule, right? So, neural net people talk all

the time about backpropagation, it turns out that backpropagation

is nothing more than the chain rule with some efficient storage

of partial quantities so that you don't keep on calculating

the same quantity over and over again.

So, it's sort of like chain

rule with memorization, that is the backpropagation algorithm. So, now, key tool is the chain rule so,

what is the chain rule? So, within saying, okay, well, what overall are we going to have

is some function where we're taking f(g(u)) of something. And so, we have this inside part z and so, what we're going to be doing is that

we're going to be taking the derivative of the outside part then,

with the value of the inside. And then, we're gonna be taking

the derivative of the inside part So for this here, so the outside part,

here's our F. And then here's our inside part Z. So the outside part is F,

which is a log function. And so the derivative of a log

function is the one on X function.

So that we're then gonna be having that this is 1 over the sum of w equals 1 to V of the exp of uw^T vc. And then we're going to be multiplying

it by, what do we get over there. So we get the partial

derivative with respect to With respect to vc, of This inside part. The sum of, and it's a little trickier. We really need to be careful of indices so we're gonna get in the bad mess if

we have W here, and we reuse W here.

We really need to change

it into something else. So we're gonna have X equals 1 to V. And then we've got the exp of UX, transpose VC. So that's made a little bit of progress. We want to make a bit more progress here. So what's the next thing we're gonna do. Distribute the derivative. This is just adding some stuff. We can do the same trick of we can do

each part of the derivative separately. So X equals 1 to big V of

the partial derivative with respect to VC of the exp of ux^T vc. Okay, now we wanna keep

going What can we do next. The chain rule again. This is also the form of here's our F and

here's our inner values V which is in

turn sort of a function.

Yeah, so we can apply the chain

rule a second time and so we need the derivative of X. What's the derivative of X. X, so this part here is gonna be staying. The sum of X equals 1 to V

of the partial derivative. Hold on no. Not that one, moving that inside. So it's still exp at its value of UX T VC. And then we're having the partial derivative with respect to VC of UXT VC. And then we've got a bit

more progress to make. So we now need to work out what this is. So what's that. Right, so

that's the same as sort of back over here.

At this point this is just going to be,

that' s coming out as UX. And here we still have the sum of X equals 1 to V of the X of UX T VC. So at this point we kind of wanna

put this together with that. Cuz we're still, I stopped writing that. But we have this one over the sum of W equals 1 to V of the exp of UW, transpose VC. Can we put those things together

in a way that makes it prettier. So I can move this inside this sum. Cuz this is just the sort of number that's

a multiplier that's distributed through. And in particular when I do that,

I can start to sort of notice this interesting

thing that I'm going to be reconstructing a form that

looks very like this form. Sorry, leaving this part up aside. It looks very like the Softmax

form that I started off with.

And so I can then be saying that this is the sum from X equals 1 to V of the exp of UX transpose VC over the sum of W equals 1 to V. So this is where it's important that I

have X and W with different variables of the X of U W transpose VC times U of X. And so well, at that point,

that's kind of interesting cuz, this is kind of exactly the form

that I started of with, for my softmax probability distribution. So what we're doing is we. What we're doing is that that part is then being the sum over X equals one to V of the probability of [INAUDIBLE]. It was wait. The probability of O given the probability of X given C times UX. So that's what we're getting

from the denominator. And then we still had the numerator. The numerator was U zero. What we have here is our

final form is U0 minus that. And if you look at this a bit

it's sort of a form that you always get from these

softmax style formulations.

So this is what we observed. There was the actual output

context word appeared. And this has the form of an expectation. So what we're doing is right here. We're calculating expectation

though we're working out the probability of every possible

word appearing in the context, and based on that probability we get

taking that much of that UX. So this is in some,

this is the expectation vector. It's the average over all

the possible context vectors, weighted by their

likelihood of occurrence. That's the form of our derivative. What we're going to want to be doing is

changing the parameters in our model. In such a way that these become

equal cause that's when we're then finding the maximum and

minimum for us to minimize.

[INAUDIBLE] Okay and so that gives

us the derivatives in that model. Does that make sense? Yeah, that's gonna be question. Anyway, so precisely doing things like this is what

will expect you to do for assignment one. And I'll take the question, but

let me just mention one point. So in this case,

I've only done this for the VC, the center vectors. We do this to every

parameter of the model. In this model, our only other

parameters are the context vectors. We're also gonna do it for those. It's very similar cuz if you look

at the form of the equation, there's a certain

symmetry between the two.

But we're gonna do it for that as well but

I'm not gonna do it here. That's left to you guys. Question. Yeah.

>> [INAUDIBLE] >> From here to here. Okay. So. So, right, so this is a sum right? And this is just the number

at the end of the day. So I can divide every term in

this sum through by that number. So that's what I'm doing. So now I've got my sum with every term

in that divided through by this number. And then I say, wait a minute,

the form of this piece here is precisely my softmax

probably distribution, where this is the probability

of x given C. And so then I'm just rewriting

it as probability of x given c. Where that is meaning,

I kind of did double duty here. But that's sort of meaning that you're

using this probability of x given c using this probability form. >> [INAUDIBLE] >> Yeah, the probability that x occurs as

a context word of center word c.

>> [INAUDIBLE]

>> Well, we've just assumed some

fixed window size M. So maybe our window size is five and so

we're considering sort of ten words, five to the left, five to the right. So that's a hypergrameter,

and that stuff's nowhere. We're not dealing with that, we just

assume that God's fixed that for us.

The problem, so

it's done at each position. So for any position, and

all of them are treated equivalently, for any position,

the probability that word x is the word that occurs

within this window at any position given

the center word was of C. Yeah? >> [INAUDIBLE] >> All right, so the question is, why do we choose the dot

product as our basis for coming up with this probability measure? And you know I think the answer

is there's no necessary reason, that there are clearly other things that you could have done and might do. On the other hand,

I kind of think in terms of Vector Algebra it's sort of the most

obvious and simple thing to do. Because it's sort of a measure of

the relatedness and similarity.

I mean I sort of said loosely it was

a measure of similarity between vectors. Someone could have called me on that

because If you say, well wait a minute. If you don't control for

the scale of the vectors, you can make that number as big

as you want, and that is true. So really the common measure of similarity

between vectors is the cosine measure. Where what you do is in the numerator. You take a dot product and then you divide

through by the length of the vectors. So you've got scale and variance and you can't just cheat by

making the vectors bigger. And so, that's a bigger,

better measure of similarity. But to do that you have to

do a whole lot more math and it's not actually necessary here

because since you're sort of predicting every word

against every other word. If you sort of made one

vector very big to try and make some probability

of word k being large.

Well the consequence would be it would

make the probability of every other word be large as well. So you kind of can't cheat

by lengthening the vectors. And therefore you can get away with

just using the dot product as a kind of a similarity measure. Does that sort of satisfy? So yes. I mean, it's not necessary, right? And if we were going to argue,

you could sort of argue with me and say no look, this is crazy,

because by construction, this means the most likely word to appear

in the context of a word is itself.

That doesn't seem like a good result, [LAUGH] because presumably

different words occur. And you could then go from there and say

well no let's do something more complex. Why don't we put a matrix to mediate

between the two vectors to express what appears in the context of each other,

it turns out you don't need to. Now one thing of course is since we

have different representations for the context and center word vectors, it's

not necessarily true that the same word would be highest because there're

two different representations. But in practice they often have a lot

of similarity between themselves not really that that's the reason. It's more that it's sort

of works out pretty well. Because although it is true

that you're not likely to get exactly the same word in the context, you're actually very likely to get words

that are pretty similar in meaning. And are strongly associated and when

those words appear as the center word, you're likely to get your

first word as a context word. And so at a sort of a macro level,

you are actually getting this effect that the same

words are appearing on both sides.

More questions, yeah,

there are two of them. I don't know. Do I do the behind person first and

then the in front person? [LAUGH] So I haven't yet done gradient descent. And maybe I should do that in a minute and

I will see try then. Okay?

>> [INAUDIBLE] >> Yeah >> So that truth is well, we've just clicked to

the huge amount text. So if our word at any position, we know

what are the five words to the left and the five words to the right and

that's the truth. And so we're actually giving some

probability estimate to every word appearing in that context and we can say, well, actually the word

that appeared there was household. What probability did you give to that and

there's some answer. And so, that's our truth. Time is running out, so maybe I'd sort

of just better say a little bit more before we finish which is sort of

starting to this optimization. So this is giving us our derivatives,

we then want to use our derivatives to be able to

work out our word vectors.

And I mean, I'm gonna spend

a super short amount time on this, the hope is through 221,

229 or similar class. You've seen a little bit of optimization

and you've seen some gradient descent. And so, this is just a very quick review. So the idea is once we have gradient

set at point x that if what we do is we subtract off a little

fraction of the gradient, that will move us downhill

towards the minimum. And so if we then calculate the gradient

there again and subtract off a little fraction of it, we'll sort of

start walking down towards the minimum.

And so,

that's the algorithm of gradient descent. So once we have an objective function and

we have the derivatives of the objective function with respect to all of

the parameters, our gradient descent algorithm would be to say,

you've got some current parameter values. We've worked out the gradient

at that position. We subtract off a little

fraction of that and that will give us new parameter

values which we will expect to be give us a lower objective value,

and we'll walk towards the minimum. And in general, that is true and

that will work. So then, to write that up as Python code, it's really sort of super simple that

you just go in this while true loop. You have to have some stopping condition

actually where you evaluating the gradient of given your objective function, your

corpus and your current parameters, so you have the theta grad and then you're

sort of subtracting a little fraction of the theta grad after the current

parameters and then you just repeat over.

And so the picture is, so the red lines

that are sort of the contour lines of the value of the objective function. And so what you do is when you

calculate the gradient, it's giving you the direction of the steepest descent and

you walk a little bit each time in that direction and you will hopefully

walk smoothly towards the minimum. Now the reason that might not work is

if you actually take a first step and you go from here to over there,

you've greatly overshot the minimum. So, it's important that alpha be small

enough that you're still walking calmly down towards the minimum and

then all work. And so, gradient descent is the most

basic tool to minimize functions. So it's the conceptually first thing to

know, but then the sort of last minute. What I wanted to explain is actually, we might have 40 billion tokens

in our corpus to go through.

And if you have to work out

the gradient of your objective function relative to a 40 billion word corpus,

that's gonna take forever, so you'll wait for an hour before

you make your first gradient update. And so, you're not gonna be able train

your model in a realistic amount of time. So for basically,

all neural nets doing naive batch gradient descent hopeless algorithm,

you can't use that. It's not practical to use. So instead, what we do Is used

stochastic gradient descent. So, the stochastic gradient descent or

SGD is our key tool. And so what that's meaning is, so

we just take one position in the text. So we have one center word and

the words around it and we say, well, let's adjust it at that one

position work out the gradient with respect to all of our parameters. And using that estimate of

the gradient in that position, we'll work a little bit in that direction. If you think about it for

doing something like word vector learning, this estimate of the gradient is

incredibly, incredibly noisy, because we've done it at one position which just

happens to have a few words around it.

So the vast majority of the parameters

of our model, we didn't see at all. So, it's a kind of incredibly

noisy estimate of the gradient. walking a little bit in that direction

isn't even guaranteed to have make you walk downhill,

because it's such a noisy estimate. But in practice, this works like a gem. And in fact, it works better. Again, it's a win, win. It's not only that doing things

this way is orders of magnitude faster than batch gradient descent, because you can do an update after you

look at every center word position.

It turns out that neural

network algorithms love noise. So the fact that this gradient descent,

the estimate of the gradient is noisy, actually helps SGD to work better

as an optimization algorithm and neural network learning. And so, this is what we're

always gonna use in practice. I have to stop there for today even

though the fire alarm didn't go off. Thanks a lot. >> [APPLAUSE].