Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 3 – Neural Networks

Google+ Pinterest LinkedIn Tumblr

Okay. Hi everyone. Okay. Let's get started. Um- great to see you all here. Welcome back for um- week two of CS224N. Um- so- so this is a little preview of what's coming up in the class for this week and next week. Um- you know, this week is perhaps the worst week of this class. [LAUGHTER]. Um- so in week two of the class our hope is to actually kind of go through some of the nitty gritty of neural networks and how they're trained, and how we can learn good neural networks by backpropagation, which means in particular we're gonna be sort of talking about the training algorithms and doing calculus to work out gradients from proving them. Um, so we are looking a bi- a little bit, at- um- um, word window classification named entity recognition.

So there's a teeny bit of natural language processing in there, but basically, sort of week two is sort of, um- math of deep learning and neural network models and sort of really neural network fundamentals. Um, but the hope is that that will give you kind of a good understanding of how these things really work, and we'll give you all the information you need to do, um- the coming up homework and so then, in week three we kind of flips.

So, then week three is going to be mainly about natural language processing so we then gonna talk about how to put syntactic structures over sentences, um- for building dependency parses of sentences which is then actually what's used in homework three. So we're chugging along rapidly. And then we'll talk about this idea of the probability of a sentence which leads into neural language models. Um- so on the homeworks. Homework one was due approximately two minutes ago, um- so I hope everyone has submitted their homework one, I mean as, um- one just sort of admonition, um- in general so you know homework one we hope you found was a good warm up and not too too hard and so really be best to get homework one in quickly rather than to burn lots of your late days doing homework one.

Um, and now right now out on the website, um there's homework two. Um so, we are chugging along. So homework two kind of corresponds to this week's lectures. So on the first part of that we are expecting you to grind through some math problems of working out gradient derivations. Um- and then the second part of that is then implementing your own version of word2vec making use of NumPy. And so this time sort of writing a Python program. It's no longer an IPython notebook. Um, I encourage you to get early, um- look at the materials, um- on the web. I mean, in particular corresponding to today's lecture there's, um- some quite good tutorial materials that are available on the website and so also encourage you to look at those.

[NOISE]. Um- more generally, just to make a couple more comments on things. I mean, I guess this is true of a lot of classes at Stanford but, you know when we get the course reviews for this class we always get the full spectrum from people who say the class is terrible and it's way too much work, um- to the people who say it's a really great class, one of their favorite classes at Stanford, obvious the instructors care, et cetera. And I mean, partly this reflects that we get this very, um- wide range of people coming to take this class on the one hand, on the right hand margin perhaps we have the physics PhDs, and on the left hand margin we have some fresh who think this will be fun to do anyway. Um, we welcome e- we welcome everybody, um- but in principle this is uh, graduate level class. You know, that doesn't mean we want to fail people out, we'd like everyone to succeed but also like graduate level class. Um- we'd like you to- you know, take some initiative in your success.

Meaning, if there are things that you need to know to do the assignments and you don't know them, um- then you should be taking some initiative to find some tutorials, come to office hours and talk to people and get any help you need and learn to sort of for any holes in your knowledge. Okay. So here's the plan for today. Um- so that was the course information update. So you know, is- this is sort of, in some sense you know machine learning neural nets intro- Just to try and make sure everyone else is up to speed on all of this stuff. So I'll talk a little bit about classification, um, introduce neural networks, um, little detour into named Entity Recognition, then sort of show a model of doing um Window- Word Window classification and then the end part, we sort of then dive deeper into what kind of tools we need to learn neural networks and so today um we're gonna go through um somewhere between review and primer of matrix calculus and then that will lead into next time's lecture where it's talking more about backpropagation and computation graphs.

So, yeah. So this material was especially the part at the end. You know for some people it'll seem really babyish if- it's the kind of stuff you do every week, um, for other people it um- might seem impossibly difficult but hopefully for a large percentage of you in the middle this will be kind of a useful review of doing this kind of matrix calculus and the kind of things that we hope that you can do on homework two. Um, okay. So um, yeah. So sorry if I'm boring some people. If you sat through 229 last quarter you saw um what a classifier was like and hopefully this will seem familiar but I'm just sort of hoping to try and have everyone in week two sort of up to speed and on roughly the same page.

So here's our classification setup. So we have assumed we have a- training data set where we have these um vector x um of our x points and then for each one of them we have a class. So the input might be words or sentences documents or something, there are d to mention vector, um, the Yi, the labels or classes that we want to classify to and we've got a set of C classes that we're trying to predict. And so those might be something like the topic of the document, the sentiment positive or negative um of a document or later we'll look a bit more at named entities.

Okay. So if we have that um- for this sort of intuition is we got this vector space which we again have a 2D picture and we have points in that vector space which correspond to Rx items and what we'd want to do is we'll look at the ones in our training sample and see which ones are green and red for our two classes here and then we want to sort of learn a line that could divide between the green and the red ones as best as possible and that learned line is our classifier.

So on traditional machine learning or statistics we have the sort of XI vectors that are data items that are purely fixed but we're going to then multiply those XI by some estimated weight vector and that estimated weight vector will then go into a classification decision. And the classifier that I'm showing here is a softmax classifier which is almost identical but not quite to logistic regression classifier which you should've seen in CS 109 or a stats class or something like that which is giving a probability of different classes.

Okay. And in particular if you've got a softmax classifier or a logistic- logistic regression classifier, these are what are called linear classifiers. So the decision boundary between two classes here is a line in some suitably high-dimensional space. So it's a plane or a hyperplane once you've got a bigger expecter. Okay. So here's our softmax classifier. Um, and there are sort of two parts to that. So in the- in the weight matrix double U we have a row corresponding to each class and then for that row we're sort of dot-producting it with our data point vector XI and that's giving us a kind of a score for how likely it is that the example belongs to that class and then we're running that through a softmax function and just as we saw on week one, the softmax takes a bunch of numbers and turn them into a probability distribution.

Does that makes sense to people? People remember that from last week? Good so far? Okay. Um, I'm not gonna go to this in detail but I mean, ah- essentially this is what the logistic regression does as well. Um, the difference is that here in this setup we have a weight vector um for each class whereas what the statisticians doing logistic regression is they say weight, that gives us one more number of weight vectors than we really need.

We can get away for- for C classes, we can get away with C minus one weight vectors. So in particular if you're doing binary logistic regression you only need one weight vector whereas this softmax regression formulation you've actually got two weight vectors one for each class. Um, so there's that sort of a little difference there which we could get into but basically the same. It's just say it's we're either doing softmax or logistic regression, doesn't matter. Um, so when we're training what we want to do is we want to be able to predict um the correct class. And so the way we're gonna do that is we're gonna wanna train our model so it gives us highest probability as possible to the correct class and therefore they'll give us low probability po- as possible um to um the wrong classes. And so our criterion for doing that is we're going to create this negative log probability um of our assignments and then we're gonna want to minimize the negative log probability which corresponds to maximizing the log probability which corresponds to maximizing um the probability.

Um. And, but, um, sort of, pretty soon now, we're gonna start doing more stuff with deep learning frameworks, in particular PyTorch and you can discover in that, that there's actually a thing called NLL loss which stands for negative log-likelihood loss. Basically, no one uses it because the more convenient thing to use is what's called the cross entropy loss and so you'll hear everywhere that we're training with cross entropy loss.

So, I just wanted to briefly mention that and explain what's going on there. Um, so the concept of cross entropy comes from baby Information Theory which is about the amount of information theory I know. Um, so, we're assuming that there's some true probability distribution P and our model, we've built some probability distribution, Q. That's what we've built with our soft-max regression and we want to have a measure of whether our estimated probability distribution is a good one. And the way we do it in cross entropy is, we go through the classes and we say, "what's the probability of the class according to the true model?" Using that waiting, we then work out the log of, um, the probability according to our estimated model and we sum those up and negate it, and that is our cross entropy measure. Okay. Um, but- so this in general gives you a measure of sort of information, um, between distributions.

But in our particular case, remember that for each example, we've sort of assuming that this is a piece of labeled training data so we are saying for that example, the right answer is class seven. So therefore, our true distribution, our p is- for this example, it's class seven with probability one and it's class, um, anything else with probability zero. So if you think about then what happens with this formula, you've got this summation of all the classes. The PFC is gonna be either one or zero and it's gonna be one only for the true class here and so what you're left with is, this is going to equal minus the log of qc, um, for the true class which is sort of what we were then computing in the previous slide. Okay. So that's- um, yeah. So that's basically where you'd get with cross entropy loss.

Um, but one other concept to mention. So when you have a full data-set of a whole bunch of examples, the cross entropy loss is then taking the per example average. So, I guess it's what information theory people sometimes call the cross entropy rate. So additionally, factored in there. If you are training it on any examples is that one on in vector that's coming in there.

Okay. Um, okay. Um, so that's cross entropy loss. Is that okay? Yeah. [NOISE] There's some- there's some mixture of the actual labels in the ground? Sure. Good question. Right. So, the simplest case is that your gold data, someone has hand labeled it and, um, they've labeled one and the rest is zero. Um, they are- you can think of cases where that isn't the case. I mean, one case is you could believe that human beings sometimes don't know the right answer so if human beings said, "I'm not sure whether this should be class three or four," you could imagine that we can make training data where we put probability half on both of them, um, and that wouldn't be a crazy thing to do, and so then you'd have a true cross entropy loss using more of a distribution. Um, the case where it's much more commonly used in actual practice is, there are many circumstances in which people wanna do semi-supervised learning.

So, I guess this is a topic that both my group and Chris Re's group have worked on quite a lot, where we don't actually have fully labeled data, but we've got some means of guessing what the labels of the data are and if we try and guess labels of data, well then quite often we'll say, "Here's this data right in. It's two-thirds chances this label, but it could be these other four labels," and we'd use a probability distribution, and yeah, then it's more general cross entropy loss. Okay? Um, right. So, um, that's cross entropy loss, pretty good with. Um, this bottom bit is a little bit different, um, which is to say, "Well now we, this is the sort of the full data-set." The other thing to notice, um, when we have a full data- we can have a full data-set of x's, um, and then we have a full set of weights. Um, where here we're working a row, a row vector for the weights for one class, but we're gonna work it out for all classes.

So, we can sort of simplify what we're writing here and we can start using matrix notation and just work directly in terms of the matrix w. Okay. So for traditional ML optimization, our parameters are these sets of weights, um, for the different classes. So for each of the classes, we have a d-dimensional, um, row vector of weights because we're gonna sort of dot-product wi- with rd, dimensional, input vector. So we have c times d items and our W matrix and those are the parameters of our model.

So if we want to learn that model using the ideas of gradient descent, stochastic gradient descent, we're gonna do sort of what we started to talk about last time. We have these set of parameters. We work out, um, the gradient, the partial derivatives of all of these, um, of the loss with respect to all of these parameters and we use that to get a gradient update on our loss function, and we move around the w's, and moving around the w's corresponds to sort of moving this line that separates between the classes and we fiddle that around so as to minimize our loss which corresponds to choosing a line that best separates between the items of the classes in some sense.

Okay. So, that's a basic classifier. So the first question is, well, how are things gonna be different with a neural network classifier? Um, so the central observation is that sort of most of the classic classifiers that people used a lot of the time, so that includes things like Naive Bayes models, um, basic support vector machines, Softmax or logistic regressions. They're sort of fairly simple classifiers. In particular those are all linear classifiers which are going to classified by drawing a line or in the higher dimensional space by drawing some kind of plane that separates examples.

Having a simple classifier like that can be useful in certain circumstances. I mean, that gives you what a machine learning as a high bias classifiers, there's lots of, talk of in CS229, but if you have a data-set, um, that's like this, you can't do a very good job at classifying all the points correctly if you have a high bias classifier because you're gonna only draw a line. So you'd like to have a more powerful classifier. Essentially, what's been powering a lot of the use of deep learning is that in a lot of cases when you have natural signals, so those are things like, um, speech, language, images, and things like that, you have a ton of data so you could learn a quite sophisticated classifier.

Um, but representing the classes in terms of the input data is sort of very complex. You could never do it by just drawing a line between the two classes. So, you'd like to use some more complicated kind of classifier. So neural networks, the multi-layer neural networks that we're gonna be starting to get into now, precisely what they do is provide you a way to learn very complex, you know, almost limitlessly complex classifiers. So that if you look at the decisions that they're making in terms of the original space, they can be learning cases like this. Um, I put this- I put the, um, pointer on a couple of the slides here. Um, this- this is a visualization that was done by Andrei Karpathy. He was a PhD student here until a couple of years ago.

So this is a little JavaScript, um, app that you can find off his website and it's actually a lot of fun to play with to see what kind of, um, decision boundaries you can get a neural net to come up with. Okay. Um, so for getting- for getting more advanced classification out of, um, a neural net used for natural language, there are sort of two things going- that you can do, that I want to talk about which are in some sense the same thing when it comes down to it. But I'll sort of mention separately at the beginning that one of them is that we have these word vectors and then the second one is that we're gonna build deeper multi-layer networks. Okay. So, at first crucial difference said, um, we already started to see, um, with what we were doing last week is rather than sort of having a word being this is the word house, we instead say house is a vector of real numbers and what we can do is change the vector that corresponds to house in such a way as we can build better classifiers, which means that we are gonna be sort of moving houses representation around the space to capture things that we're interested in like word similarity, analogies, and things like that.

So this is actually, you know, kind of a weird idea compared to conventional steps or ML. So rather than saying we just have the parameters w, we also say that all of these word representations are also parameters of our model. So, we're actually going to change the representations of words to allow our classifiers to do better. So, we're simultaneously changing the weights and we're changing the representation of words, and we're optimizing both of them at once to try and make our model as, um, good as possible. So, this is the sense in which people often talk about the deep learning models that we're doing representation learning. I sort of said there are two ways, I was going to mention two things. One is this sort of, um, word vector representation learning and then the second one is that we're going to start looking at deeper multi layer neural networks. Um, sort of hidden over here on the slide is the observation that really you can think of word, word vector embedding as just putting your, having a model with one more neural network layer.

So, if you imagine that each word was a one hot vector, um, with, for the different word types in your model. So, you had a, uh, you know, 150,000 dimensional vector with the one-hot encoding of different words. Um, then you could say you have a ma-, um, matrix L which is sort of your lexicon matrix and you will pass your one-hot vector for a word through a layer of neural net which multiplies the one-hot vector or L1, the one-hot vector. And since this was a one-hot vector, what that will have the effect of doing is taking out a column of L. So, really, we've got an extra layer of matrix, um, in our neural net and we're learning the parameters of that matrix in the same way as we're learning, um, a deep neural network for other purposes.

So, mathematically that completely makes sense and that's sort of a sensible way to think about, um, what you're doing, um, with word embeddings in neural networks. Um, implementation wise, this makes no sense at all and no one does this because it just doesn't make sense to do a matrix multiply when the result of the matrix multiply will be, okay. This is word ID 17, um, sort of, then constructing a one-hot vector of length a 150,000 with a one in position 17 and then doing a matrix multiplied, makes no sense. You just take up, um, column or, or, the row, as we've discussed, 17 of your matrix and that's what everyone actually does. Okay. Here's my one obligatory picture of neurons, um, for the class. So, don't miss it, I'm not going to show it again, all class. Okay.

So, the origins [LAUGHTER] of Neural Networks, um, was in some sense to try and construct an artificial neuron that seemed to in some sense kind of capture the kind of computations, um, that go on in human brains. It's a very loose analogy for what was produced but, you know, our model here is these are our, this is our a TB part of our human brains.

So, here are neurons, this is a neuron cell here and so, what does a neuron consist of. Um, so, up the back, it's got these dendrites, lots of dendrites. Then it's got a cell body and if there's stuff coming in on the dendrites, um, the cell body will become active and then it all starts spiking down this long thing which is called the Axon. So, then these axons lead to the dendrites of a different cell or lots of different cells, right. This one, um, I'm not sure it's shown but some of these are kind of going to different cells. Um, and so, you then have these sort of, um, terminal buttons on the Axon which are kind of close to the dendrites but have a little gap in them and some min-, miracles of biochemistry happen there.

So, that's the synapse, of course, which you'll then have sort of activation flowing which goes into the next neuron. So, that was the starting off, um, model that people wanted to try and simulate in computation. So, people came up with this model of an artificial neuron. So, that we have things coming in from other neurons at some level of activations. So, that's a number X0, X1, X2. Um, then synapses vary depending on how excitable they are as to how easily they'll let signal cross across the synapse. So, that's being modeled by multiplying them by a weight W0, W1, W2.

Then the cell body, sort of correctly, is sort of summing this amount of excitation it's getting from the different dendrites, um, and then it can have its own biases to how likely it is to fire, that's the B. Um, so, we get that and then it has some overall kind of threshold or propensity for firing. So, we sort of stick it through an activation function, um, which will sort of, will determine a firing rate and that will be, um, the signal that's going out on the output axon. So, that was sort of the starting point of that but, you know, really, um, for what we've ended up computing.

We just have a little bit of baby math here which actually, um, looks very familiar to the kind of baby math you see in linear algebra and statistics and so it's really no different. So, in particular, um, a neuron can very easily be a Binary Logistic Regression Unit. Um, so that, this is sort of, for logistic regression you're taking for your input X, you multiply it by a weight vector. You're adding, um, your, um, bias term and then you're putting it through, um, a non linearity, like the logistic function. Um, and then, so you're calculating a logistic regression, um, inside this sort of neuron model. Um, and so this is the, this is the difference between the soft maximum logistic regression, that I was saying that there is the soft-max for two classes has two sets of parameters.

This sort of just has one set of parameters Z and your modeling the two classes by giving the probability of one class from 0 to one, depending on whether the input to logistic regression is highly negative or highly positive. Okay. So, really, we can just say these artificial neurons are sort of like binary logistic regression units or we can make variants of binary logistic regression units by using some different F function. And we'll come back to that again and pretty soon. Okay. Um, well, so that gives us one neuron. So, one neuron is a logistic regression unit for current purposes. So, crucially what we're wanting to do with neural networks is say, well, why only run one logistic regression, why don't we, um, run a whole bunch of logistic regressions at the same time? So, you know, here are our inputs and here's our little logistic regression unit, um, but we could run three logistic regressions at the same time or we can run any number of them.

Um, well, that's good but sort of for conventional training of a statistical model which sort of have to determine for those orange outputs of the logistic regression. You know, what we're training each of them to try and capture. We have to have data to predict what they're going to try and capture. And so, the secret of sort of then building bigger neural networks is to say, we don't actually want to decide ahead of time what those little orange logistic regressions are trying to capture. We want the neural network to self-organize, so that those orange logistic regression, um, units learn something useful.

And well, what is something useful? Well, our idea is to say, we do actually have some tasks that we want to do. So, we- we have some tasks that we want to do. So maybe, we want to sort of decide whether a movie review is positive or negative, something like sentiment analysis or something like that. There is something we want to do at the end of the day. Um, and we're gonna have, uh, logistic regression classifier there telling us positive or negative. Um, but the inputs to that aren't going to directly be something like words in the document. They're going to be this intermediate layer of logistic regression units and we're gonna train this whole thing to minimize our cross entropy loss. Essentially, what we're going to want to have happen in the back propagation algorithm will do for us, is to say, you things in the middle, it's your job to find some useful way to calculate values from the underlying data such that it'll help our final classifier make a good decision.

I mean in particular, you know, back to this picture, you know. The final classifier, its just a linear classifier, a soft-max or logistic regression. It's gonna have a line like this. But if the intermediate classifiers, they are like a word embedding, they can kind of sort of re-represent the space and shift things around. So, they can learn to shift things around in such a way as you're learning a highly non-linear function of the original input space.

Okay. Um, and so at that point, it's simply a matter of saying, well, why stop there? Maybe it gets even better if we put in more layers. And this sort of gets us into the area of deep learning and sort of precisely, um, this is, um, that sort of there was- sort of being three comings of neural networks. So the first work in the 50s which is essentially when people had a model of a single neuron like this and then only gradually worked out how it related to more conventional statistics than there was. Um, the second version of neural networks which we saw the 80s and early 90s, um, where people, um, built neural networks like this that had this one hidden layer where a representation could be learned in the middle.

But at that time it really wasn't effective. Of all people weren't able to build deeper networks and get them to do anything useful. So you sort of had these neural networks with one hidden layer and so precisely with research that started in- into deep learning that precisely the motivating question is, um, we believe we'll be able to do even more sophisticated, um, classification for more complex tasks. Things like speech recognition and image recognition if we could have a deeper network which will be able to more effectively learn more sophisticated functions of the input which will allow us to do things like recognize sounds of a language. How could we possibly train such a, um, network so they'll work effectively? And that's the kind of thing, um, will go on to, um, more so starting this lecture more so in the next lecture.

But before we get to there, um, just to underline it again. So once we have something like this is our, um, layer of a neural network. We have a vector of inputs, we have a vector of outputs and everything is connected so that we've got this sort of weights along every one of these black lines. And so we can say A1 is you're taking weights times each component of X1 and adding a bias term, um, and then you're going to be running which is sort of this part and then running it through our non-linearity and that will give us an output. And we're gonna do that for each of A1, A2, and A3. Um, so again, we can kind of regard A is a vector and we can kind of collapse it into this matrix notation for working out the effects of layers. The fully connected layers are effectively matrices of weights, um, and commonly rewrite them like this where we have a bias term as a vector of bias terms. There's sort of a choice there. You can either have an always on import and then the bias terms become part of the weights of a slightly bigger matrix with one extra, uh, one extra either column or row.

One extra, a- row, right? Or you can just sort of have them separately within those Bs. Okay. Um, and then the final note here- right? So once we've calculated this part, we always put things through non-linearity which is referred to as the activation function and so something like the logistic transform I showed earlier is an activation function. And this is written as sort of vector in port, um, activation function giving a vector output, and what this always means is that we apply this function element-wise.

So we're applying the logistic function which is sort of a naturally a one input one output function like the little graph I showed before. So when we apply that to a vector, we apply it to each element of the vector element-wise. Okay. We will come back very soon to sort of saying more about non-linearities and what non-linearities people actually use. Um, but, you know, something you might be wondering is well, why does he always have these non-linearities and say there has to be an f function there? Why don't we just, um, calculate Z equals WX plus B in one layer and then go on to another layer that also does Z2 equals W2, Z1 plus B and keep on going with layers like that? And there's a very precise reason for that which is if you want to have a neural network learn anything interesting, you have to stick in some function F which is a non-linear function such as the logistic curve I showed before.

And the reason for that is that if you're sort of doing linear transforms like WX plus B and then W2 Z1 plus B, W3Z2 plus B and you're doing a sequence of linear transforms. Well, multiple linear transforms just composed to become a linear transform, right? So one linear transform is rotating and stretching the space somehow and you can rotate and stretch the space again but the result of that is just one bigger rotate and stretch of the space. So you don't get any extra power for a classifier by simply having multiple linear transforms. But as soon as you stick in almost any kind of non-linearity, then you get additional power. And so you know in general, what we're doing when we're doing deep networks, um, in the middle of them we're not thinking, "Ah, it's really important to have non-linearity thinking about probabilities or something like that." Our general picture is well, we want to be able to do effective function approximation or curve fitting.

We'd like to learn a space like this and we can only do that if we're sort of putting in some non-linearities which allow us to learn these kind of curvy decision, um, patterns. And so- so F is used effectively for doing accurate [NOISE] fu- function approximation or sort of pattern matching as you go along. Okay. You are behind already. Um, okay. So that was the intro to baby neural networks. All good? Any questions? Yes? Yeah, like er, feature one and feature four if- if you multiply it together it's highly indicative of like the label Y, can you get to that product relationship to just say [NOISE] couple of layers that are linear? Um, yes.

Good question. So, in conventional stats, you have your basic input features and when people are building something like a logistic regression model by hand, people often say well, something that's really important for classification is looking at the pair of feature four and feature seven. Um, that you know, if both of those are true at the same time something i-important happens and so that's referred to normally in stats as an interaction term, and you can by hand a-add interaction terms to your model. So, essentially a large part of the secret here is having these intermediate layers. They can learn, build interaction terms by themselves. Yeah, so it's sort of, um, automating the search for higher-order terms that you wanna put into your model. Okay. I'll go on, other questions? Okay. Um, so um, yeah. So here's a brief little interlude on a teeny bit more of NLP which is sort of a kind of problem we're gonna to look at for a moment.

So this is the task of named entity recognition that I very briefly mentioned last time. So, um, if we have some text, wait, it isn't appearing here. Okay. Uh, okay. If we have some text, something that in all sorts of places people want to do is I'd like to find the names of things that are mentioned. Um and then normally, as well as, finding the names of things you'd actually like to classify them, say it's like to say some of them are organizations, some of them are people, um, some of them are places. And so you know this has lots of uses, you know, people like to track mentions of companies and people and newspapers and things like that. Um, people when they do question-answering that a lot of the time the answers to questions are what we call named entities the names of people, locations, organizations, pop songs, movie names all of those kind of things are named entities. Um, and if you want to sort of start building up a knowledge base automatically from a lot of text, well, what you normally wanna do is get out the named entities and get out relations between them.

So this is a common task. So, how can we go about doing that? And a common way of doing that is to say well, we're going to go through the words one at a time and they're gonna be words that are in a context just like they were for word to deck, and what we're gonna do is run a classifier and we're going to assign them a class. So we're gonna say first word is organization, second word is organization, third word isn't a named entity, fourth word is a person, fifth word is a person and continue down. So in running a classification of a word within a position in the text so it's got surrounding words around it. Um and so to say what the entities are many entities are multi-word terms and so the simplest thing you can imagine doing is just say we'll take the sequence that are all classified the same and call that the e-entity Shen Guofang or something like that.

There's a reason why that's slightly defective and so what people often use is that BIO encoding, um, that I show on the right but I'll just gonna run ahead and not do that now. Um so, it might seem at first that named entity recognition is trivial because you know, you have company names Google and Facebook are company names. And whenever you see Google or Facebook you just say company and how could you be wrong? But in practice, there's a lot of subtlety and it's easy to be wrong in named entity recognition. So this is sort of just some of the hard cases. So it's often hard to work out the boundaries of an entity. So in this sentence, First National Bank don-donates two vans to Future School of Fort Smith. So, there's presumably the name of a bank there but is it National Bank and the first is just the first word of a sentence which is cap-capitalized like first she ordered some food or something.

So kind of unclear what it is. Sometimes it's hard to know whether something's an entity at all. So at the end of this sentence is Future School the name of some exciting kind of 21st-century school or is it just meaning it's a future school that's gonna be built in this town, right? Is it an entity or not at all? Working out the class of an entity is often difficult so to find out more about Zig Ziglar and read features by what class is Zig Ziglar? Kinda hard to tell if you don't know. Um, it's actually a person's name, um, and there are various entities that are ambiguous, right? So Charles Schwab in text is 90% of the time an organization name because there's Charles Schwab Brokerage.

Um, but in this particular sentence here, in Woodside where Larry Ellison and Charles Schwab can live discreetly among wooded estates, that is then a reference to Charles Schwab the person. So there's sort of a fair bit of understanding variously that's needed to get it right. Okay. Um, so what are we gonna do with that? And so this suggests, um, what we wanna do is build classifiers for language that work inside a context. Um, so you know, in general, it's not very interesting classifying a word outside a context we don't actually do that much in NLP.

Um, but once you're in a context, um, then it's interesting to do and named entity recognition is one case there are lots of other places that comes up. I mean, here's a slightly cool one, that there are some words that can mean themselves and their opposite at the same time, right? So to sanction something can either mean to allow something or it can mean to punish people who do things or to seed something can either mean to plant seeds and things that you're seeding the soil or it can take seeds out of something like a watermelon, right? You just need to know the context as to which it is. Okay. So, that suggests the tasks that we can classify a word in its context of neighboring words and any has an example of that. And the question is how might we do that? And a very simple way to do it might be to say, "Well, we have a bunch of words in a row which each have a word vector from something like word to vec.

Um, maybe we could just average those word vectors and then classify the resulting vector. The problem is that doesn't work very well because you lose position information. You don't actually know anymore which of those word vectors is the one that you're meant to be classifying. So, a simple way to do better than that is to say, "Well, why didn't we make a big vector of a word window?" So, here are words and they each have a word vector, and so to classify the middle word in the context of here plus or minus two words, we're simply going to concatenate these five vectors together and say now we have a bigger vector and let's build a classifier over that vector.

So, we're classifying this x window which is then a vector in, ah, 5D if we're using D dimensional word vectors. We can do that um in the kind of way that we did previously which is, um, that we could say, "Okay, for that big vector we're going to learn w weights and we're put- gonna put it through a softmax classifier, and then we're going to do the decisions." Um, that's a perfectly good way to do things and, um, for the purpose of it.

What I want to get to in the last part of this is to start looking at my, um, matrix calculus. And you know we could use this model and do a classifier and learn the weights of it and indeed, um, for the handout on the website that we suggest you look at it does do it with a softmax classifier of precisely this kind. Um, but for the example I do in class I try to make it a bit simpler. Um, and I've wanted to do this I think very quickly because I'm fast running out of time. So, one of the famous early papers of neural NLP, um, was this paper by Collobert and Weston which was first an ICML paper in 2008 which actually just a couple of weeks ago, um, won the ICML 2018 test of time award. Um, and then there's a more recent journal version of it 2011. And um, they use this idea of window classification to assign classes like named entities, ti- to words in context, um, but they did it in a slightly different way. So, what they said is, "Well, we've got these windows and this is one with the, um, location named entity in the middle and this is one without a location entity in the middle.

So, what we want to do is have a system that returns a score, and it should return a high score just as a real number in this case and it can should return a low score if it- if there isn't, ah, location name in the middle of the window in this case. So, explicitly the model just return the score. So, if you had the top level of your neural network a, and you just then dot product did with a vector u, you then kind of with that final dot product, you just return a real number. They use that as the basis of their classifier. So in full glory, what you had is you had this window of words, you looked up a word vector for each word, you then, um, multiplied that the, the- well you concatenated the word vectors for the window.

You multiplied them by a matrix and edited a bias to get a second hidden layer which is a and then you multiply that by a final vector and that gave you a score for the window and you wanted the score to be large if it was the location and small, if it wasn't a location. So, in this sort of pretend example where we have four dimensional word vectors, um, that's meaning you know for the window, this is a 20 x 1 vector. Um, for calculating the next hidden layer we've got an 8 by 20 matrix plus the bias vector. Then, we've got this sort of 8-dimensional second hidden layer and then we are computing a final real number.

Okay. Um, and so crucially this is an example of what the question was about. Um, we've put in this extra layer here, right? We could have just said here's a word vector, a big word vector of, of context. Let's just stick a softmax or logistic classification on top to say yes or no for location. But by putting in that extra hidden layer precisely this extra hidden layer can calculate non-linear interactions between the input word vectors. So, it can calculate things like if the first word is a word like museum and the second and the second was a word like the preposition in or around then that's a very good signal that this should be, ah, location in the middle position of the window. So, extra layers of a neural network let us calculate these kind of interaction terms between our basic features.

Okay. Um, so there's a few more slides here that sort of go through the details of their model, but I'm gonna just skip those for now because I'm a little bit behind. And at the end of it we've just got this score. So this is our model which is the one that I just outlined where we're calculating the score and we're wanting a big score, um, for location. And so, what we're gonna want to do is consider, um, how we can use this model, um, to learn, um, our parameters in a neural network. Um, so in particular, remember it's the same story we've had before. We had a loss function J, and we're wanting to work out, um, the gradient with respect to our current theta parameters of the loss function.

Then, we want to sort of subtract a little multiple of that, um, given by the learning rate from our current parameters to get updated parameters, and if we repeatedly do then stochastic gradient descent we'll have better and better parameters which give higher probability to the things that we're actually observing in our training data. So, the thing we want to know is, well, in general how can we do this um, differentiation and work out the gradient of our loss function? And so, I sort of wanted to sort of this the remaining time in this lecture, um, go through how we can do that by hand, um, using math and then that'll lead into sort of discussing and more generally the backpropagation algorithm, um, for the next one.

Okay. So, if we're doing um, gradients by hand well we're doing multi-variable calculus, multi-variable derivatives. But in particular normally the most useful way to think about this is as doing matrix calculus which means we're directly working with vectors and matrices to work out our gradients, and that that's normally sort of much faster and more convenient for summarizing our neural network layers than trying to do it in a non vectorized way. But that doesn't mean that's the only way to do it. If you're sort of confused about what's going on, sometimes thinking it through in the non vectorized way can be a better way to understand what's going on and, um, make more progress. So, like when, um, last time I did the word2vec um derivatives when I was writing too small on that board, sorry, um, that was doing it in a non vectorized way of working out the weights, talking about them individually.

Um, but here we're going to do it with, um, vectors and matrices. And again, look for the lecture notes to cover this material in more detail. In particular, so that no one misses it. Um, let me just clarify what I mean by lecture notes. So, if you look at the course syllabus on the left-hand column, um, there's the slides that you can download and, on straight under the slides, it says lecture notes. That's what I'm meaning by the lecture notes. In the- in the middle column it then has some readings and actually there are some diffe- additional things there that cover similar material. Um, so there's, um, so there's they might be helpful as well.

But first the thing that's closest to what I'm about to present, it's the lecture notes that appear immediately under the slides link. Okay. Um, so my hope here, um, my hope here is the following: Um, if you can't remembered how to do single variable calculus, sorry you're basically sunken and might as well leave now. Um, [LAUGHTER] I'm assuming you know how to do single-variable calculus and I'm assuming you know what a um a vector and a matrix is. Um, but you know, um, I sort of hope that even if you never did multi-variable calculus or you can't remember any of it, it's sort of for what we have to do here, not that hard and you can do it. So, here's what, um, what you do. Um, all right. So, if we have a simple function f of x equals x cubed, right. Its gradient, um, and so the gradient is the slope, right? Saying how steep or shallow is the slope of something, and then when we and also saw the direction of slope when we go into multiple dimensions.

Um, its gradient is just as derivatives. So, its derivative is 3x squared. Um, so if you're at the point x equals 3, that you know, the sort of this 27 of sloppiness, um, is very steep. Okay. So well, what if we have a function with one output but now it has many inputs? Um, so that we're sort of doing that sort of, um, function that was like the dot products where we're doing the sort of the UTV or WTX, um, to calculate a value. Well, then what we're gonna calculate is a gradient which is a vector of partial derivatives with respect to each input. So, you take, um, the slope of the function as you change x1, the slope of the function as you change x2 through the slope of the, ah, function as you change xn and each of these you can just calculate as if you were doing single variable calculus and you just put them all in a vector and that's then giving you the gradient and then the gradient and multi-dimensional, um, spaces then giving you the direction and slope of a sort of a surface that touches your multi-dimensional, um, f function.

Okay. So that's getting a bit scarier, but it gets a little bit scarier than that because if we have a neutral network layer, um, we then have a function which will have n inputs, which are the input neurons, and it will have m outputs. So if that's the case, um, you then have a matrix of partial derivatives which is referred to as the Jacobian. So in the Jacobian, um, you're sort of taking these partial derivatives, um, with respect to each, um, output along the rows and with respect to each input down the columns. And so you're getting these m by n partial derivatives, considering every combination of an output and an input.

Um, but again, you can fill in every cell of this matrix just by doing single-variable calculus provided you don't get yourself confused. Okay. Um, then we already saw when we were doing word2vec, that sort of a central tool that we have to use to work out, um, to work out, um, our derivatives of something like a neural network model is we have a sequence of functions that we run up one after another. So, um, in a neural network you're sort of running a sequence of functions one after another. So we have to use, um, the chain rule to work out derivatives when we compose functions. So if we have one variable function, so we have, um, C equals 3y and y equals x squared. If we want to work out, um, the derivative of z with respect to x, we say, aha, that's a composition of two functions. So I use the chain rule.

And so that means what I do is I multiply, um, the derivative. So I take, um, dz/dy. So that's 2x, um, wait, [NOISE] Sorry, I said that wrong, right? Is my example wrong? Oh yeah, its right, dz/dy. So yeah, dz/dy is just three. That's, right, that's the derivative of the top line, and then dy/dx is 2x.

And I multiply those together and I get the answer, um, that the derivative of z with respect to x is 6x. Okay. Um, this bit then gets a little bit freakier, but it's true. If you have lots of variables at once, you simply multiply the Jacobians and you get the right answer. So if we're now imagining our neural net, well sort of, this is our typical neural net right? So we're doing the neural net layer where we have our weight matrix multiplied their input vector plus, um, the bias, and then we're putting it through a non-linearity.

And then if we want to know what's the partials of h with respect to x, we just say, huh, it's a function composition. So this is easy to do. We work out our first Jacobian, which is the partials of h with respect to z, and then we just multiply it by the partials of z with respect to x, and we get the right answer. Um, easy. Um, so here's sort of um an example Jacobian which is a special case that comes up a lot. Um, so it's just good to realize this one which we'll see with our neural net. So well one of the things that we have are these element-wise activation function. So we have h equals f of z. So, um, what is the, um, partial derivative of h with respect to z. Um, well the thing- remember that we sort of apply this element-wise. So we're actually saying hi equals f of zi. So, you know, formally this function has n inputs and n outputs, so it's partial derivatives are going to be an n by n Jacobian.

But if we think about what's happening there, um, what we're actually going to find is, sort of, when we're working out the terms of this so we're working out, how does f of zi change as you change zj? Well, if j is not equal to i, it's gonna make no difference at all, right? So if my f function is something like putting it through the logistic function or anything else absolute valuing a number, it's gonna make no difference for the calculation of f of zi if I chains zj because it's just not in the equation. And so, therefore, the only terms that are actually going to occur and be non-zero are the terms where i equals j.

So for working out these partial derivatives if i does not equal j, um, it's zero. If i does equal j, then we have to work out a single-variable calculus. What's the derivative, um, of the, um, activation function, um, for- and so this is what, a um, Jacobian looks like for an activation function. It's a diagonal matrix. Everything else is zero, and we thought this activation function, we work out its derivative, and then we calculate that for the difference, um, we have it for the different kind of um, zi values. Okay. Um, so that's a, um, Jacobians for an activation function. What are the other main cases, uh, that we need for a neural network? And these I'll go in through a little bit more slowly in the same lecture notes.

But they're kind of similar to what we saw in the very first class. So if we are wanting to work out the partial derivatives of wx plus b with respect to x, um, what we get is w. Um, and if we want to work out the partial derivative of wx plus b with respect to b, um, that means that we get an identity matrix because b is sort of like a 1b, right? It's this almost always on vector, so you're just getting the ones coming out to preserve the b.

Um, this was the case, um, that we saw, um, when we were doing the word vectors. That if you have a vector dot product of u and h and you say, what's the partial derivatives of that with respect to u, then you get out h transpose. Um, if you haven't seen those before, um, look at the lecture notes handouts, um, and see if you can compute them and they make sense at home, um, but for the moment we're gonna believe those and use those to see how we can then work out derivatives inside the neural network. Okay. So here's the same neural network we saw before. So we have a window of words, we're looking at word vectors, we're putting it through a hidden layer, and then we're just doing a vector modal, um, vector dot product, you get this final score.

And so, what we [NOISE] want to do to be able to train our neural network, is we want to find out how- how s changes depending on all the parameters of the model. The x, the w, the b, the u. Um, and so we want to work out partial derivatives of S with respect to each of those because we can then work out okay if you move b up, um, the score gets better, which is good if it's actually a plus in the middle, and therefore we'll want to nudge up, um, elements of b appropriately.

Okay, um, and so I'm just doing the gradient with respect to the score here and I skipped over those couple of slides. Um, so if you're just, sort of, staring at this picture and say, well, how do I work out the partial derivative of s with respect to b? Um, probably it doesn't look obvious. So the first thing here that you want to do is sort of break up the eq- equations into simple pieces that compose together, right? So you have the input x, and then that goes into z equals wx plus b, and then you compose that with the next thing. So h equals f of z, our activation function, and then this h goes into the next thing of s equals uTh.

So we've got these sequence of functions. And pretty much you want to break things up as much as you can. I mean, I could have broken this up even further. I could have said z1 equals wx, z equals z1 plus b. Um, it turns out um, but if you've just got things added and subtracted, you can sort of do that in one step because that sort of pathway separating the, when doing the derivatives, but sort of anything else that composes together you want to pull it out for the pieces. Okay. So now our neural net is doing a sequence of function compositions.

And when we say, okay, we know how to do that, the chain rule. So if you wanna work out the partials of s with respect to b, it's just going to be the product of the derivatives of each step along the way. So it's gonna be um the partial of s with respect to h times h with respect to z times z with respect to b and that will give us the right answer. So then all we have to do is actually compute that. Um, so, I think this just sort of shows okay we're taking the partials of each step of that composition.

Okay. So now we want to compute that. And so this is where I'm going to sort of use the Jacobians that I sort of asserted without much proof on the preceding slide. Okay. So first of all um we have ds/dh. Well, that's just the dot product of two vectors. So the um, the Jacobian for that is just h transpose. Okay, that's a start.

Then we have um h equals f of z. Well, that's the activation function. So the um Jacobian of that is this diagonal matrix made of the element wise um derivative of the function f. And then we have the partial of z with respect to b and that's the bit that comes out as the identity matrix. And so that's then giving us our calculation of the partial of s with respect to b. And so we can see that the- the identity matrix sort of goes away so we end up with this composition of ht times f prime of z. Okay, suppose we then want to go on and compute now the partial of s with respect to w? Well, as starting off point is exactly the same chain rule that we work out each of the stages. So, that first of all you're working out the z from the wx part then putting it through the non linearity, then doing the dot product of the vectors. So that part is the same. And what you should notice is that if you compare the partial of s with respect to w versus s with respect to b, most of them are the same and it's only the part at the end that's different.

And that sort of makes sense in terms of our neural net right? That when we had our neural net that the w and the b were coming in here. And once you've sort of done some stuff with them you're putting things through the same activation function and doing the same dot product to create a score. So, you're sort of doing the same calculations that you're then composing with. So it sort of makes sense that you should be getting the same derivatives that are occur- same partial derivatives that occurring at that point.

Oops. And so effectively you know these partial dev- derivatives correspond to the computations in the neural network that are above where w and b are. And so those are commonly referred to as delta, note delta which is different from partial derivative d. And so delta is referred to as the error signal and neural network talk. So, it's the what you're calculating as the partial derivatives above the parameters that you are working out the partial derivatives with respect to. So, a lot of the secret as we'll see next time, a lot of the secret of what happens with backpropagation is just we want to do efficient computation in the sort of way that's computer science people like to do efficient computation. And so precisely what we want to notice is that there is one error signal that comes from above and we want to compute it once. And then reuse that when calculating both partial derivatives with respect to w and with b.

Okay. So there's sort of two things to still do. So one is well, it'd be kind of useful to know what the partial derivative of s with respect to w actually looks like. I mean, is that a number, a vector, a matrix, a three-dimensional tensor? And then we actually want to work out its values and to work out its values we're going to still have to work out the partial derivative of z with respect to w. But if first of all we just try and work out its shape, what kind of shape does it have? And this is actually sort of a bit tricky and is sort of a dirty underbelly of doing this kind of matrix calculus. So, since our weight vector is an n by m matrix, the end result of the partial of s with respect to w is we have a function with n times m inputs all of the elements of w and simply one output which is our score. So, that makes it sound like according to what I said before we should have a one by n times m Jacobian. But it turns out that's not really what we want, right? Because what we wanted to do is use what we calculate inside this stochastic gradient descent update algorithm.

And if we're doing this with sort of like to have the old weight matrix and we'd like to subtract a bit format to get a new weight matrix. So, be kind of nice if the shape of our Jacobian was the same shape as w. And so we- we and in general what you always want to do with neural nets is follow what we call the shape convention which is we're going to sort of represent the Jacobian so it's in the same shape as the inputs.

And this whole thing is kind of the- the bad part of the bad part of doing matrix calculus. Like there's a lot of inconsistency as to how people represent matrix calculus. That in general if you just go to different fields like economics and physics some people use a numerator convention. Some people use a denominator convention. We're using neither of those. We're going to use this shape convention so we match the shape of the input so it makes it easy to do our weight updates. Okay. So. Right. So that's what we want the answer to look like. So, then the final thing we need to do to work out on the partial of s with respect to w is we have the error signal delta that's gonna be part of the answer and then we want to work out the partial of z with respect to w. Well, um what's that going to be.

Well, it turns out and I'm about to be saved by the bell here since I'm down to two minutes left. Um, it turns out that what we end up with for that is we take the product of the partial- the product of delta times x. So effectively we've got the local error signal above w. And then we have the inputs x and we are working out an outer product of them. And the sort of way to think about this is sort of for the w's. You know, we've got the elements of the w matrix, these different connections between our neurons. And so each one of these is connecting one output to one input. And so we're going to be sort of making this n by m matrix of our partial derivatives that are going to be the product of the error signal for the appropriate output multiplied by input and those goes give us the partial derivatives.

I'm skipping ahead quickly in my last one minute. Okay. So uh, right. So this is sort of what I said have used the shape con- convention. I'm going to skip that. Okay. So, um, I- I ran out of time a teeny bit at the end but I mean, I think hopefully that's conveyed most of the idea of how you can sort of use the chain rule and work out the derivatives and work them out in terms of these vector and matrix derivatives. [NOISE] And essentially what we wanna do for backpropagation is to say how can we do ah get a computer to do this automatically for us and to do it efficiently.

And that's what's sort of the deep learning frameworks like TensorFlow and PyTorch do and how you can do that. We'll look at more next time..

As found on YouTube