12.1: What is word2vec? – Programming with Text

Google+ Pinterest LinkedIn Tumblr

hello welcome to a new session from I don't know is it the machine learning course is that the programming with text scores I don't know I'm just here I'm just the person who's here and this session which will be a whole bunch of videos is about a topic word Tyvek I'm ringing the bell way too much so first of all I want to mention something very important i-i've known about word Tyvek and I've used it in projects for a little while but I don't think I ever really understood it and I don't even know that I really do understand it but I I definitely improved my understanding of it vastly after reading this amazing tutorial by Allison Parrish it's a post it as a gist on github it's a Python notebook understanding word vectors by Alison Parrish you know honestly if I find being truthful you should just stop this video right now and read this instead you know a few some people seem to like to listen to me prattle on which is fine you could keep watching if you so choose read this after then at the very least and so this tutorial is a released under Creative Commons by 4.0 license the code itself is the Creative Commons 0 license so you can reuse this material which is what I'm doing right now I don't usually do this it got me I'm all our stuff is always based on other people's stuff but this this first video I'm really gonna like talk through what's in this tutorial in my own words but if you do the same please reference with attribution according to the license ok so I also want to mention that Allison Parrish has a wonderful talk it's on YouTube I will link to it called experimental creative writing with a vectorized word from a strange loop conference so I also encourage you to take a look at that as inspiration and background for what it is I want to show you my end goal with this tutorial is to get to the point where I have a p5 JavaScript sketch in the browser where I can do stuff with word defect what is words the point of this video that you're watching right now is I'm taking a very long time to start is just to answer the question what is word Tyvek by the end of it i want to use word defect in two projects to make weird stuff happen with text on a webpage all right how are you feeling so all right so let's let me come over here for a second because I've written word to back up here that's gonna help me the idea of word Tyvek and there's the this is a machine learning process similar to other things that I've done that looked at like classification is this image a cat or a dog or a regression analysis what's the what's can you predict the price of this house based on certain properties of that house these are classic machine learning examples word Tyvek is a particular machine learning model that produces something calls a word embedded now what oh that's a very very fancy term and what it means is that any given word like Apple can be associated with numbers a vector this we can basically somehow come up with this sort of like numeric mathematical essence of this word as some array of numbers like zero point seven and one point two and negative point zero point 3 4 5 etc etc and there's going to be some amount of numbers in here this seems like a crazy thing why would I ever want to have a word associated with an array of numbers well one of the things that one can do with arrays of numbers is math linear algebra multiplying subtracting averaging adding so we know we can do that with arrays of numbers and this is the kind of thing that happens in lots of my other tutorials with programming graphics and pixel processing and machine learning but one thing we wouldn't know how to do is how would we say you know Apple + + orange but that could be I was trying to like come up with something a good example this is what happens when you don't play on these tutorials in advance I could come up with an example on the fly Apple + purple would this equal plum maybe right like in other words like I'm trying to come up with some like pseudo math like let's take these two words and add them together like cat plus cute maybe that equals kitten good can I take and like we're not saying about concatenate Apple purple were saying Apple plus purple could I get those for mathematical essence of these words add them together and get a new word well the theory the prompt the idea here that the argument that I that I am making to you is that word Tyvek is a mechanism by which you can do stuff like this right there in your code if I could quantify the word Apple as a series of numbers and I could quantify the word purple as a series of numbers then couldn't I just add all those numbers together I would get a new series of numbers and then I might look and find which word or has a set of numbers that is most close to these set this set of numbers how could I find the similarity I could calculate a similarity score between any two sets of numbers I could find the word that has the most similar to this Plus this and maybe it would be plum why would it be plum is that magic it's because what data that's word defect model is trained on oh yes it's the latter but and so I want to get to all of that okay this is my sort of like a zoomed out view of why we're doing this let's come over and look at what Allison has in her particular tutorial here which are which is a really nice example if I look at this we can say like well imagine like a really simple case right I was sort of saying over here each word gets a list of maybe a hundred numbers maybe it's three hundred numbers maybe it's a thousand numbers this is up to us to sort of figure out decide based on what we're trying to do but what if we simplify that and here's Allison's example where each word gets essentially two numbers and those numbers are data properties of that word like a cuteness score from 0 to 100 and a size from 0 to 100 so you could say kitten is 95 15 and hamster is 80 comma 8 right there are these numbers that's sort of like the label is tied to a set of data points data properties so if that's the case then we can look we could graph all of those and we could say something like oh you know like a horse and the dolphin are kind of like similar in terms of size and in terms of size and cuteness and then we could start to do things but but actually like we could do a mathematical analysis like what is the actual Euclidean distance Euclidean distance means the number of well this pixels or units between these two words right here these are very similar because they're physically close to each other and we can also do things you can think of those as and this is a nice demonstration of this idea this is why we we talk about it as vectors right I have a whole set of tutorials about vectors describing as describing points in space so for example a vector a velocity vector if I have a particle in a particle system and I wanted to go from here to here this is its velocity its change in location in essence this is basically what I'm doing with an operation like this for example what if I said ok well Apple is over here and then I'm going to add purple to it I'm going to move by purples numbers and over here I now find plum so when we look at this in two dimensions it kind of makes a we can sort of like our brains can understand that two dimensions is like the easiest dimension I mean I have to define two dimension be easier than one dimension one dimensions weird sometimes so what Allison is showing here is by moving from let's say one word to another word physically in space we can establish this idea of word relationships chicken is two kitten as tarantula is two hamster now this is all very arbitrary with like hard coded word vectors so the but this is just for demonstration purposes and in two dimensions so that our brains can kind of process it ultimately if we have a lot more information somehow about all of these words in higher dimensional space in vectors that have a hundred dimensions a hundred numbers we can't visualize that so easily there are interesting techniques for called dimension reduction reduction ality reducing the dimensionality that we could then draw like word clusters and stuff and maybe I'll get to that later but what I'm trying to say here is that we can establish sophisticated complex relationships between words in higher dimensional space but in order to do that it's useful to look at single example that ties words to numbers in a low dimensional space that we can either visualize or sort of like put into our brains and so I've kind of described to you what word Tyvek is what the model looks like when it's complete I haven't looked at all about the training process right the animals example is hard-coded I'm gonna show you I'm gonna do a port of one of Alison's examples of words associated colors associated with numbers right a color a word red is 255 comma zero comma zero that's a word to a vector and that's going to be from a data set and then the third thing that I'm gonna do is look at what is traditionally thought of as word to Veck these higher dimensional large large dictionaries of words and their associated vectors those word embeddings so that's going to be the journey here I don't know how many videos it's gonna be 3 4 5 471 something like that and then at some point I'll try to also do some projects with that so in the next video I'm going to do a port of Alison's project which you can find all in Python all the code in Python on that tutorial that's linked in the description and I'm gonna do a JavaScript port of it okay so I'll see you there maybe maybe not go read that page it's excellent okay good bye [Music]

As found on YouTube