Word2Vec (tutorial)

Siraj Raval · Beginner ·🛠️ AI Tools & Apps ·9y ago

Key Takeaways

This video tutorial by Siraj Raval covers the basics of Word2Vec, a technique for creating word vectors from a corpus of text, using a Game of Thrones dataset. It demonstrates how to process a dataset from scratch, create word vectors, and visualize them using dimensionality reduction techniques like t-SNE.

Full Transcript

I've been listening to a lot of classical hey am I live I must be live right now because I hit the live button but I can't see myself so I always have to wait like 15 seconds or so there I am okay great great hello world it's SJ how's it going ah so today we are here live and we are going to make some word vectors out of a series of books called Game of Thrones if you know Game of Thrones give it a shout out in the in the comments let's see who knows what this is it doesn't matter if you don't the point is we are learning about the concept of word vectors and we want to take some books and make them into vectors and once we have these vectors we're gonna do a bunch of really cool stuff with it all right so who's in the house let me name some names we got Jake Akash Party tahare Teddy recky party Ricardo Angel got a lot of people in the house all right so okay uh let's go ahead and uh do a five minute Q&A and then we're going to get into the code it's going to be an IPython notebook and I'm gonna give you guys uh the data set as well so uh five minute Q&A let's get started glove or word to VC we're using word to VC not glove uh although glove is similar also I haven't used glove before but I've heard good things another question best character in Game of Thrones uh to be honest I don't even read Game of Thrones I just picked it because I thought it was I thought I think the idea is cool I don't have time to read anything or watch TV shows and anymore I'm just focused on content uh any matths we are going to do yes we're going to do some math we're going to use the cosine similarity uh as a a measure of distance between word vectors it's a corpus yes it's five different books but we're going to treat it as one big Corpus please say my name peush okay um okay doing the Deep Learning Foundation and feeling utterly lost keep pushing through or going back to school okay guys listen Okay so clearly this is there's a lot of stuff in deep learning and there's a lot of math and from what I've read I have taking your feedback into consideration we need to really dive into the math and we're going to we're going to go so deep into the math in the next video okay so the next weekly video get ready we are going to really dive into the math please say my name Colin what's a word Vector I'll explain that in a second clean the camera there's moisture okay okay um video isn't clear bro okay uh can't help that right now is open AI Worth to explore yes one minute rap before this I love your rap I'll do that um yeah I'll do that um let me answer some more questions how can I classify images from a live feed and put the output in a file from a live feed you'd want to oh man in a live feed I can well we know how to do it recorded but in a live feed you'd want a live classifier to look at this to look at your screen or maybe segment the screen so you'd want to use probably JavaScript for that JavaScript would be easy uh conv vet. JS Andre karpa's Library would be great for that the cam looks dirty hey man I was just at the beach um recording some cool stuff uh predict who dies using word vectors that that's going to be possible are you going to use tensorflow for this no we're going to use uh word Toc and we're going to use a bunch of other smaller libraries all right okay so one more question and then we're gonna get started okay here we go if we increase vocab size by adding words will that help yes yes it helps uh it it helps if if well it helps if um if the words are relevant to the to the problem we're trying to solve uh if if they're relevant to the the the story which is Game of Thrones in this case Okay so that's it for the questions uh let's get started with this I'm going to start screen sharing it's going to be an IPython note book uh and then we're going to oh that's right a wrap uh let's let's do a little wrap for a second um uh let me do a wrap on vectors um vectors right okay so start the Toral with the WAP okay here we go uh I wrap about vectors I do it man because I'm Victor Victor Vector what's the vector Victor I'm on a plane going crashing down like my mind is a scaler scal of clouds going down scal of clouds going up I see one two three in the sky and it's my enemy no it's not it's just mine and I'm going to fly back to the screen so here we go that was it for the WAP uh and let's get started because we are getting serious guys this is all about war effect yes it's an English favorite movie The Matrix The Matrix how about that I'm just gonna add to the list of people who said The Matrix let's get started this is an NLP session we're gonna do some natural language processing ready okay let's let's start screen sharing let's start screen sharing okay okay let me move you guys over here okay all right so let's uh get started with this let me make sure that this is showing hold on okay so uh we want to let me just first say we want to the the the goal is to create word vectors from G from uh Game of Thrones uh data set now look guys these are these are just the five books let me show you guys I'll just show you guys uh and play with them and uh and analyze them to see semantic similarity okay let's let's look at what uh we've got here so hold on uh I can even show you guys the data set for a second so these are just the five books for Game of Thrones okay they're literally just the text files for the books they are these huge text files for the books that's that's that's all they are okay and I just took five of them I downloaded them from Pirate Bay no regrets uh you know what I'm saying so that's what that's what this is and that's it we've just got five books in the series we're going to take all these books and we're going to create word vectors from them we're going to treat it as one big Corpus okay so that's what we're gonna do so let's just go Dive Right into this baby all right so the first thing we want to do is we want to import our dependencies now we've actually got a lot of dependencies for this um so I'm going to explain every single one so the first one we want to do is import future and why do we want to import future can anyone tell me why in the comments as I type this the reason we want to import future is because it's the missing link between Python 2 and Python 3 it allows us to use the syntax from both it's it's kind of like a bridge between the the two languages and we're going to import three functions that we're going to use for this okay so that's the first step uh once we have that we want to uh we're gonna we're gonna encode our word so we're gonna en for word encoding how how are we going to encode our words well we're going to use codex the next one is we're going to to perform some R Rex and Rex is basically whenever we want to uh we want to like search some file really fast that's that's that's all about Rex it's it's a it's it's a way of like quickly and efficiently searching through a large text or number database for what you need uh the next one is for logging so actually we don't need we need we don't need a log oh now I I actually haven't talked about concurrency before so this is going to be interesting we're going to import this multiprocessing library to do perform concurrency and if you don't know concurrency is a way of running multiple threads and having each thread run a different process so it's it's it's it's multi-threading uh multi-processing it's a way of moving it's a way of having your program run faster okay and I haven't talked about this before but we're gonna do this a lot in the future especially when we get to distributed machine learning uh which is later on in in this course okay so so that's it for multiprocessing the next one is dealing with the operating Sy operating system uh like we want to like reading a file like reading a file and for that we we want to use the OS module and then we want to do some pretty printing make it human readable and how do we do that we're going to import pretty print pet print for sure okay uh the next one is more regular expression so glob was uh I show you guys a difference here but this is for this is for a more granular regular Expressions so this this is so this is like step one by the way this is like Step Zero imported dependencies and I've got a few more and then we're going to get started with the actual logic of the code okay so uh what else do we got we got the natural language toolkit because we're gonna be using nltk Okay natural language toolkit let me show you guys nltk for a second if you if you don't know now you know let me show you guys ntk because nltk and I made videos about this before is the okay nltk is awesome it is so easy to use let me zoom in on this thing okay literally it can tokenize sentences in single single lines of code so if you have a sentence like at 8:00 on Thursday morning Arthur didn't feel very good you you feed that to nltk and boom it'll give you the tokens for each word why is this useful well you can have part of speech taging POS tagging that which means like oh is this a noun is this a verb is this a CD how does it know these things because it has a pre-trained well it's it's it's actually two things it's for some for for some of it it's got a pre-trained model which it trained uh and and another part of nltk is it's got It's it's using the uh which I talked about in the last video which is the U uh like having that that database of of pre-recorded sentiments the Lexicon that's the word I'm looking for all right so then we're gonna talk we're gonna use word CHC now word Toc is the that is what we are going to use that is the real meat of this code and I'm gonna really deep dive into what word Toc is at a high level right now word Toc is what Google created uh to basically they trained uh they train a Mach they train a neural network on a huge data set of word vectors words and it created vectors and we can use these vectors in other ways so it's like a generalized collection of word vectors okay so I'm going to talk about all this in a second let me just keep typing out these uh dependencies the next one is dimensionality reduction once we have our word vectors they are going to be multi-dimensional they're going to be 300 plus Dimension word vectors okay because they're so generalized and we want to plot them on a 2d graph so we can see them and how we going to do that well we're going to perform a a technique called dimensionality reduction I have a video on this called uh visualize data set easily check it out check it out okay now so then we're going to our math Library which is numpy and then we're going to import our plotting Library which is going to be math plot live and finally we're going to uh uh what's the next one par well paral data we got one more after this so many libraries but yo it's it's it's important because right now we're talking I we want to talk about the concepts here and we're going to dive into these specific processes later on okay uh so import pandas SPD and lastly here's the last one visualization caborn caborn is going to help us visualize our data set okay as SNS okay that's it for our dependencies boom all right so okay so now that we've done that uh we are going to we are not using pre-trained word vectors we are we are using word vectors that we train uh in real time uh let's see no model name pi pyot oh py plot py plot no model named pandas oh pandas see this is why I love python I python notebooks okay so uh okay so that was that now our next step is to process our data Okay so step one is to process our data what does this look like well before we do anything before we do anything we want to clean our data so how do we clean our data well nltk has a really handy function for this well the first one is called punct and the next one is called uh stop wordo so what does this do what this does is it downloads Punk downloads uh it's it's a tokenizer it's a pre-trained tokenizer it's a pre-trained tokenizer and what it's going to let us do is tokenize our text and remember I talked about tokenization last time tokenization is where we take a sentence or a word and we take sorry we take a a a piece of text and we split it into tokens and those tokens Can Be sentences or words or even characters it's whatever we specify in this case we're going to do sentences okay so that's what Punk does and stop wordss are words like and or sorry like and and the N A you know of words that don't really matter they don't really have a lot of semantic meaning and we want to remove these words and why do we want to do that so that our our vectors that we create are are more accurate okay so so that so we've done that and it's gonna download okay and it's it says okay you've already got them if we don't have them it's going to download it the next step is to get the book name matching the text file and now like I said we had them right here so let me let me make this bigger here's our books right here so the end in text file so we can just say Okay so get the book file names book file names and where we use glob glob is gonna let us get those books that just have that end in. txt right and it's going to print those out for us make sure that we actually uh that we that we actually printed them all right and it ends there and it starts here okay boom boom okay no no no no no there we go okay so that's for our text file and uh let me print out the books Let's print them out let's print them out make sure that we got them file names sorted glo. glob let's see what we got here. [Music] txt um let's see hm I so we should have them here and if we don't then the problem is that we hold on sorted glob duck glob let's see that the hold on a second so um let's see let's see let's see uh interesting so what the what this is is um let's see let's see uh okay so okay uh so it seems like they might be in the root yeah they might be in the root on a second and what I can say is on a second open closing folder hold on a second this is actually not oh you're right right right what are people saying dot slash okay hold on a second oh my God somebody people saying some hold on okay so okay okay okay okay okay invalid attacks so let's get that current directory swear to God this is so annoying right now hold on okay let's okay so um hold on a sec okay so oh my God okay syntax book file names what are you talking about I just named them right here let's see book file names hold on a second invalid syntax hold on so we really don't have time for errors like this my God should have okay okay guys so I actually had a comment with a lot of upvotes and the comment was we should just not spend time actually writing out the code and just show the code since I'm already reading off of it anyway uh so I already have the code anyway so why don't I just show you guys the code and I'm gonna explain it as we go okay because seriously I've don't have time to look at why this is not working right now print book file names that's exactly what it was okay so we don't we don't have time for this okay so let's just talk about this okay let me let me move this over here okay and I've got my notes here as well okay hold on a second so okay so where were we let me let me see if this actually works LC why let me just let me let me just let me just let me just restart this whole High python notebook and run this from scratch okay let's run it from scratch hi python notebook and let me close out everything and okay so here we go hold on that was I miss boom okay so let's see let's have it give it to us okay and we'll try out this one okay so let's let's just run what we have here PAB inline populating log log log boom okay okay file names okay so clearly there's there clearly we're having a problem right now and the point is let's just okay so this is this is pre-compiled anyway and I talk about what this is doing and we're gonna really deep dive into the important Concepts here okay so we're gonna keep going onward we don't have time to stop with this right now so okay so we got our our book file names right and we next step is to combine the book into one string and why do we want to do this because we want to have one Corpus for all of those books and that's what this does we initialize a raw Corpus we say youu Let me let me make this bigger because we are we really wna we start with U because it's Unicode right it's a Unicode string and we want to convert it into a format that we that we can read easily and what is that format utf8 right here okay so utf8 so this is where the Codex Library comes into play we are using a codex library to read in the book file name and convert it into utf8 format now that remember that Corpus raw function we just initialized up here well now we want to add all of the books that we see to that Corpus and the way we're going to do that the way we're going to do that is we're going to uh add we're going to add it all to this Corpus raw and at the end of it it's going to have all of those books in one in one variable in memory Corpus raw okay which is going to be a very very very big big variable okay that's what we're going to do and so that's the first step and once we have that then we're going to split the Corpus into sentences now remember when I said we downloaded that uh punct uh model right up here let me show you guys nltk download punct well now we're gonna actually load that into memory that it's a train model and it's it's loaded in it's in a by stream and and that's what pickle is it's a it's that file format that we can load as a by stream now that's what that does and it's going to it up into this tokenizer variable this tokenizer is pre-trained it turns uh words into tokens and the type of tokens we want are sentences in our case right so we'll use a tokenizer uh we'll use a tokenizer to tokenize that Corpus that which is every single word we have right and let me let me let me open this so every single word we have and this could be anything guys this could be any piece of text you want any book anything you download any big piece of text this same the same principles apply okay um and we're going to put those all into this raw sentences variable once we have that raw sentence is variable we're going to convert it into a word list so what do I mean by a word list well um I also hold on a second make sure so for our word list we want uh a list of words but right but so but but what what exactly do I mean by that let me um so for our word list let me let me see the comments what are you guys up to uh okay so this is okay so let me let me comment this as well so you want to convert into a list of words remove unnecessary characters we want to remove unnecessary characters that's why we have that A to Z A to Z we want to split split into words no hyphens you know and and it's GNA be a a a list of words so that's an array so it's it's sorry it's a list of words okay once we have that list of words for each sentence we're going to take those words and tokenize it so it's going to be a sentence where each word a sentence where each word is tokenized that's what this does so for every sentence that we have we're going to initialize an empty sentence list we're g to add it we're going to add sentences to it and um and then once we have that uh all right so I've got some comments that I should slow down a little bit I can do that I can do that okay so let me slow down a little bit there's so much I'm trying to cover so let me I'll slow down okay so we have a list of words each word is tokenized okay each list of words is considered a sentence and we can see that when we print it out so we have our raw sentence which is he was an old man past 50 he seen as the lordlings come and go this is a sentence taking directly from Game of Thrones and it's at the it's at the fifth index so if if because we combine all of our books uh in order this is going to be the fifth sentence in the first book because it's one big Corpus okay and uh once we have that uh once we have that we're going to uh convert it into a word list which is tokens and see these you uh characters they basically help us convert them into their Unicode representations right so whenever we're dealing with words any kind of text we have to make sure it's in the right format and unicode is that format we want for vectors okay uh and and utf8 is that format for reading from files so once we have our B book Corpus let's print print out how many tokens we have okay um and each sentence is and what we're going to do is we're going to consider each uh uh uh each each sentence a token and we can print out there's one 1,800 you know 10,000 tokens okay so okay so no we're gonna get into vectors I'm going to explain vectors in a second in detail we haven't got to that part okay so that's what we that's what we've just done now we're going to train word to VC okay these are our hyperparameters let's let's let's talk about vectors for a second like where okay so I'm sure I can find a great image for this in a second so word embeddings are here so tensor flow probably has a great image for this so okay so here's a great one here's a great one copy image address let's blow this image up let's get it really big okay so here are vectors for example so look at the one on the left okay look at the one on the left King man woman Queen these are word vectors so when we have a set of words okay we have a set of words like so let's say um you know um masculine uh John uh you know B Sergey you know like words that are like like men right so so so words that all have the same semantic similarity we can all we can generalize all these words into a vector representation we also call them word embeddings so there's a lot of terms here but we want to make sure that it's it's all really the same things it's we don't want to be confused word embeddings word vectors same thing so if we have a set of words we can generalize them into man okay so that's what man is okay so man woman king queen these are generalized Vector representations that we have created after training on a very large Corpus of text okay so when we have a new word like we say like uh you know what's what's something manly I mean this is this is right now this is a gender landmine I mean this example but let's let's let's pick an example like what do men have uh that women don't um this is like uh you know um right so uh a penis right that's the only thing I can think of that's like literally true and doesn't cross any kind of uh you know other things so penis right so penis would then uh if if we were to feed it to a a a trained word Tove model it would see that penis is closest to man okay why does it know this because it's it's trained on a corpus of text to know to eventually find the similarity that that generalized representation across all of these words and how does it do this well it converts words into vectors and then it creates vectors of vectors and these vectors are when we when we plot them out we can see how similar different things are yeah beard was a great one as well okay so balls okay because it's attach right and so then once we once we have these vectors we can see what's similar to them right we can do all sorts of things mainly there are three things we want to do okay let me write this down let me write this down so write this down this is very important so once we have vectors three main tasks there are three main tasks tasks that are some women have beards so three main tasks that vectors help with okay here are the three main tasks uh the one of them is distance okay hold on distance uh similarity similarity and ranking okay so those are generally what they help us with and um and what do I mean by that similarity like if so vectors are good for things like if we wna if we want to think about if we think about uh any kind of uh what's the word I'm looking for like if you want to see what's how similar two words are which we're going to do later on if you want to rank to something right so if we wanted to browse all the scientific papers in the world will create vectors out of all of them and then we want to rank them in terms of some Metric that we decide like what's the one that has the most information on say uh climate change we could then rank them semantically so it would say here's the number one paper that has the most uh verb verb usage or word usage about the about the topic climate change and it uses vectors for that okay these are really useful okay so let's go ahead and talk about these hyper parameters so numb features is a dimensionality of these words vectors okay um so word so number of features is the dimensionality and we say 300 because we say 300 because this we could say 400 we could say 500 the more Dimension let me say the more Dimensions the more complex so more computationally complex sorry expensive to train uh but uh more but also more accurate okay so the more dimensions a vector has the more generaliz more Dimensions means more generalized more generalized okay so we're gonna say 300 for now okay uh minimum word count threshold is uh what is the the the smallest uh set of words that we want to recognize okay when we convert to a vector the number of threads to run in parallel so if we are so what is the actual structure of a vector that's that's a great question the structure of a vector is so the definition of a vector and we're going to talk about this in the next video but the def we're going to really deep dive into it later but the definition of a vector is a a it's it's a set of numbers okay and uh in in this context in machine learning in physics it's got a different context we talk about the it talks about a direction in our case we're just talking about a set of numbers so we could think of it as a list of bit of huge list of numbers okay that's what that's kind of what it's represented as um okay so and and in tensor flow we think we use the word tensor because a tensor is an n-dimensional uh array of numbers so a vector is a type of tensor so Vector is a type of tensor so this doesn't really belong here I'm just saying this right now Vector is a type of tensor okay so so then the number of threads to run in parallel this is where that multi-processing library that we imported comes into play okay we uh want to say how many workers do we have so the more we have the faster our model train more more workers faster we train context window length is the the the the size of of the of of of what we're looking at at a at a time like this the size of like if we think of it as like looking at blocks of seven words at a time that's the context window down down sample setting for frequent words is uh um hold on scaler Vector Matrix tensor exactly we're going to be talking about those in the next video but uh down sample settings for frequent words is uh once we have once we once we've our our trained word tobeck model is noticing a lot of frequent words we don't want to have to look at them constantly so any number between zero and 1 E5 is good for this so that's generally that's that's what we found is is a is a good uh it's that's generally a good uh use case for this but basically how often do we want to look at the same word the more frequent a word is the less we want to use it to create vectors because it's already a part of our train model uh so then the seed is for the random number generator right so that's what that is random number generator and why do we use a random number generator we use it to pick what part of the text we're going to look at to turn into vectors okay um and the C makes sure that it's deterministic this is good for debugging deterministic good for debugging okay so this is our actual model right here a word Toc model we imported from the Gen Sim library and let me show you guys gen SIM for a second gen Sim is super useful uh it's for topic modeling basically you give it any kind of Corpus like this and it'll create a model and it'll train it you can save it you can load it later on and then given some words like woman and King something that was in the that something that in the actual Corpus it'll give you words like it'll give you things like how similar they are uh uh what doesn't match uh what's the what give you the straight up Vector so you could use it later on J Sim's a great Library okay so that's gonna actually train our model and this is going to take since since our model is relatively small in the context of deep learning this is only going to take 30 seconds or so to train okay um all right so and it's actually not the same definition as in physics and there's a lot of debate about this actually I've been looking at a lot of stack Overflow answers and a lot of quora answers and it's crazy how much people are debating over these words in machine learning but the point is it's just numbers it's number representations that we that we create and then we can feed into our model okay so then we're going to build our vocabulary and uh so then we're going to build our vocabulary uh using those sentences okay this is how we actually uh this is how we load the the Corpus into memory we haven't actually trained it we've built our model right this is step three build our model but uh which I should should have wrot written up here so step three is build model build model okay so once we've built our model we have loaded our Corpus that we cleaned into memory and we printed out the size of it now we can start training and it's going to train on all of those sentences we gave it it's going to take 30 or 30 or 40 seconds and when it's done training we're going to save the file for use later on okay and easily do that using that that OS module great for that we can we can save it and we we can save it and then we could load it later on in fact we can load it right now we we'll load it for memory right now okay um and once we have that we're going to compress those okay so that's going to be 300 dimensional word vectors once it's trained so in this Thrones TUC uh model right here that we've trained it's G to have it's G to contain all of those word vectors that we just trained it's gonna everything is in memory right here but these are 300 dimensional word vectors okay we cannot map a 300 dimensional word Vector on a plot for us puning humans to see how are we going to do that well we're going to use a method called TSN which stands for T what it t stochastic distributed labor embedding okay and I have a great video on that um basically let me let me just um it's an it's an awesome technique this is not useful but basically I a great video let me let me just type the video name for that um PCA is another my video it's called uh what's it called how to visualize a data set easily I really dive into this this this uh method right here how to visualize a data set easily in a nutshell TSN takes our 300 dimensional vector and squashes it into just two Dimensions why so that we can then plot it and view it how does it do this it's actually a long explanation and and the video is great for that five minut explanation definitely check it out okay um but uh right so once we so that's what TSN does it's going to create those vectors and so it's going to squash it okay and we're g to take all those vectors and put them in one gigantic Matrix right we've initialized TSN here but we haven't trained TSN right so TSN is a model it's a machine learning model and we have to train it okay so we'll train it on that word Vector Matrix and this going to take a minute or two like it says and uh uh so uh it's going to create this word Vector it's a 2d Matrix right so this is one gigantic Matrix and it's got the the the plots on the points with it okay so then we're going to plot what we've got okay so what do I mean by plot well we want to plot it in 2D SP space so for every word we have in that vocab we want to we want to have we want to have three uh columns the word the X the the x coordinate and the y-coordinate now how does it get these coordinates well that's what TSN does not only does it does it uh not only is it uh it's squashing these vectors into uh two-dimensional vectors but it's also giving us the X and Y coordinates of those vectors in in two-dimensional space okay so these are all words from that Corpus right these are all game of throny words right uh so that's what that does and uh once we've got that then we're going to plot them on a graph so this is where map plot live comes into play right we're going to plot these points and we're going to plot them on a graph and it's a lot these are our word vectors there's there's a lot of them here right and we've we we've brought it down to scale so we could see a lot of them but all of our word vectors or word embeddings whatever you want to call them are here in 2D space now what are we going to do with them well we can see what vectors are close to each other let's let's start with that let's see what what vectors are close to each other and what that tells us about the data okay so uh the first thing we want to do is zoom in on this right and that's what this function does it creates a bounding box of X and Y coordinates in that graph that we have and it shows just that bounding box that's what this function does okay and so then we'll use that we'll use that to we'll say okay so in the bounds of this in the the XY bounds of of these coordinates that we give it let's see what we what it gives us this is what it gives us when we look in this corner well what are all these um these are names bistan Greer kleon sandor these are all names and they look like male names as well right they they look pretty much male okay so interesting just by training vectors on our our model it created uh just by training and creating vectors and plotting them in a two- dimensional graph using TSN uh I remove stop Wars and special characters uh stop wordss was included in that list uh at least in nltk um but it it shows that these words are all close to each other because it knows that uh the the distance between these world is small so it graphs them very close to each other it's pretty cool right so if you look at a different region if so if we look at a different region we'll see that hey this is food right this is a totally different region pepper pickled you know Cod olives turnips these are all similar words and I want to really stress the importance the brevity of vectors okay not the brevity the wrong word wrong word Vector the the this the enormous awesomeness of vectors okay word vectors are for word clusters are related there's so much we can do with this um in every field in in legal in law we can we can train an AI judge using this things using semantic similarity to see the differences between a different case data we could doctors we could we could see what's similar we could use this to find new drugs like what is um what is the semantic what is the coign similarity or what is some some similarity metrics that we Define between a corpus of scientific paper papers on uh some problem that we're trying to solve so we could so this most similar uh function is already created for us but basically we'll say well we'll give it Stark and then it'll show this the similarity by a number so it'll rank them all of these names are similar to Stark right um and how are we doing this well there's a lot of methods for measuring semantic similarity there's a lot of methods for measuring the similarity between vectors and the one we're using here is the cosine similarity so let me let me let me bring that up the cosine similarity um the cosine similarity possible distance metric we can use because turning these words into vectors turning our videos into vectors turning our images into vectors gives us a way to mathematically reason about these things we can we can reason about them just like we reason about numbers in a in a in a in a in a in a mathematical way right so uh so this is the this is the formula for the cosine similarity right so given two vectors we can we can use the dotproduct and the magnitude of those vectors to calculate them okay so that's one method there's a lot there's like the the haian I think it's called like thean similarity um like that or the sorry the goian no of course the goian similarity but it's like the anyway there's a lot and and I'll I'll link them more in the in the in in a different on so um anyway so then uh we'll use that to say okay so given these three words Stark Winterfell and riverrun it'll say well Stark is related to Winterfell as X is related to River Run and what is X and that's what this does it says okay so it's going to measure the sity between two the the first two parameters we give it and then it's going to say for that similarity be that like 05 or 6 and 7 what is the what is what is something that similar to that last parameter River Run and it'll find that from our list of existing vectors and in that case in our case it'll be Tuli Tyrion Danny things like that okay um okay so so there's that that's the end of this already and I and I because I didn't type out the code it went a little f F uh but uh yeah so let's wait where's my where is my screen here let me stop screen sharing and go back to this oh man okay hi guys let's uh do our ending five minute Q&A and then we're going to end this live stream okay I have an awesome video coming out for you guys so I've been working really hard too thank you srum I really appreciate it uh I'm really excited about this next video I'm using a Macbook uh thanks vocalize uh what else thanks party can you please elaborate on how is the projection of each word of each work to the coordinate work to the coordinate what do you mean what are real life applications of this great questions thanks Z Roo okay real life applications of word vectors take any take any uh piece of text take a book okay after this live stream download a book okay download an ebook and convert it to a text format and then use the code that I give you and you can easily feed it to word Toc and create vectors what do you do with these vectors you well then you can besides the similarity and the distance um what's a what's a good application for Beck like download a corpus of what your friends are saying you could see what what's what you could rank personalities like you know what you know chats like this guy's chats versus this guy's chat or uh this guy's um you know what he said in a speech versus what he said if you want to compare you know Hitler to Trump I just went political not trying to go political anyway but I just did uh if you want to compare anything word vectors are good for that um ranking uh words are everywhere guys um any kind of similarity or ranking that's what it's for and there's a lot of there's a lot of possibilities okay so how are the words assembled into vectors is it just context are there any ontological network being built if so how um right so when Google release word to okay so train neural network on these vectors and these are labeled this is it was a labeled Corpus of words and actually we could do we could do this unsupervised as well but basically um we we convert words to vectors and single single words to vectors and so that gives us a number like 08 or 0.9 and once we have those vectors then we could create even more generalized vectors by looking at the similarity and the similarity could be like you know this is 0 n999 and this is 9998 and these are very similar okay so and that creates even more generalized factors okay so anyway guys I've got to go uh I've got some editing to do some shooting to do some Udacity talking to do uh and I love you guys uh we're going to and uh next live stream is going to be much more you know uh dope and not that this one wasn't they're all dope uh I'm dope you guys are dope we're all dope okay so for now I gotta go take a chill pill so thanks for watching love you guys

Original Description

In this video, we'll use a Game of Thrones dataset to create word vectors. Then we'll map these word vectors out on a graph and use them to tell us related words that we input. We'll learn how to process a dataset from scratch, go over the word vectorization process, and visualization techniques all in one session. Code for this video: https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE Join us in our Slack channel: http://wizards.herokuapp.com/ More learning resources: https://www.tensorflow.org/tutorials/word2vec/ https://radimrehurek.com/gensim/models/word2vec.html https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words http://sebastianruder.com/word-embeddings-1/ http://natureofcode.com/book/chapter-1-vectors/ Please subscribe. And like. And Comment. That's what keeps me going. And please support me on Patreon: https://www.patreon.com/user?u=3191693 Follow me: Twitter: https://twitter.com/sirajraval Facebook: https://www.facebook.com/sirajology Instagram: https://www.instagram.com/sirajraval/ Instagram: https://www.instagram.com/sirajraval/ Signup for my newsletter for exciting updates in the field of AI: https://goo.gl/FZzJ5w Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: http://chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available): https://www.wagergpt.co
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Siraj Raval · Siraj Raval · 0 of 60

← Previous Next →
1 What is Bitcoin?
What is Bitcoin?
Siraj Raval
2 5 Ways to Use Bitcoin
5 Ways to Use Bitcoin
Siraj Raval
3 BTC Fever - Siraj [Music Video]
BTC Fever - Siraj [Music Video]
Siraj Raval
4 5 Reasons to Build Decentralized Apps
5 Reasons to Build Decentralized Apps
Siraj Raval
5 The Interplanetary File System
The Interplanetary File System
Siraj Raval
6 How to Build a Dapp in 3 min
How to Build a Dapp in 3 min
Siraj Raval
7 Life Before Smartphones
Life Before Smartphones
Siraj Raval
8 4 Ways to Use Smart Contracts
4 Ways to Use Smart Contracts
Siraj Raval
9 3 Dapps You HAVE to See
3 Dapps You HAVE to See
Siraj Raval
10 Char's Life as a BitTorrent Engineer
Char's Life as a BitTorrent Engineer
Siraj Raval
11 4 Reasons AlphaGo is a Huge Deal
4 Reasons AlphaGo is a Huge Deal
Siraj Raval
12 Build a Neural Net in 4 Minutes
Build a Neural Net in 4 Minutes
Siraj Raval
13 Sentiment Analysis in 4 Minutes
Sentiment Analysis in 4 Minutes
Siraj Raval
14 The Hackathon Life
The Hackathon Life
Siraj Raval
15 Your First ML App - Machine Learning for Hackers #1
Your First ML App - Machine Learning for Hackers #1
Siraj Raval
16 Build an AI Composer - Machine Learning for Hackers #2
Build an AI Composer - Machine Learning for Hackers #2
Siraj Raval
17 Build a Game AI - Machine Learning for Hackers #3
Build a Game AI - Machine Learning for Hackers #3
Siraj Raval
18 Build a Movie Recommender - Machine Learning for Hackers #4
Build a Movie Recommender - Machine Learning for Hackers #4
Siraj Raval
19 Build an AI Artist - Machine Learning for Hackers #5
Build an AI Artist - Machine Learning for Hackers #5
Siraj Raval
20 Build a Chatbot - ML for Hackers #6
Build a Chatbot - ML for Hackers #6
Siraj Raval
21 Build an AI Reader - Machine Learning for Hackers #7
Build an AI Reader - Machine Learning for Hackers #7
Siraj Raval
22 Build an AI Writer - Machine Learning for Hackers #8
Build an AI Writer - Machine Learning for Hackers #8
Siraj Raval
23 Build a Chatbot w/ an API - ML for Hackers #9
Build a Chatbot w/ an API - ML for Hackers #9
Siraj Raval
24 One-Shot Learning - Fresh Machine Learning #1
One-Shot Learning - Fresh Machine Learning #1
Siraj Raval
25 Generative Adversarial Nets - Fresh Machine Learning #2
Generative Adversarial Nets - Fresh Machine Learning #2
Siraj Raval
26 Tone Analysis - Fresh Machine Learning #3
Tone Analysis - Fresh Machine Learning #3
Siraj Raval
27 Generate Rap Lyrics - Fresh Machine Learning #4
Generate Rap Lyrics - Fresh Machine Learning #4
Siraj Raval
28 Build an Autoencoder in 5 Min - Fresh Machine Learning #5
Build an Autoencoder in 5 Min - Fresh Machine Learning #5
Siraj Raval
29 Build a Self Driving Car in 5 Min - Fresh Machine Learning #6
Build a Self Driving Car in 5 Min - Fresh Machine Learning #6
Siraj Raval
30 Build an Antivirus in 5 Min - Fresh Machine Learning #7
Build an Antivirus in 5 Min - Fresh Machine Learning #7
Siraj Raval
31 TensorFlow in 5 Minutes (tutorial)
TensorFlow in 5 Minutes (tutorial)
Siraj Raval
32 Build a Recurrent Neural Net in 5 Min
Build a Recurrent Neural Net in 5 Min
Siraj Raval
33 Build a Simulation in 5 Min
Build a Simulation in 5 Min
Siraj Raval
34 Build a TensorFlow Image Classifier in 5 Min
Build a TensorFlow Image Classifier in 5 Min
Siraj Raval
35 Tensorboard Explained in 5 Min
Tensorboard Explained in 5 Min
Siraj Raval
36 Generate Music in TensorFlow
Generate Music in TensorFlow
Siraj Raval
37 Build a Game Bot (LIVE)
Build a Game Bot (LIVE)
Siraj Raval
38 Deep Learning Frameworks Compared
Deep Learning Frameworks Compared
Siraj Raval
39 Introduction - Learn Python for Data Science #1
Introduction - Learn Python for Data Science #1
Siraj Raval
40 Build a Neural Network (LIVE)
Build a Neural Network (LIVE)
Siraj Raval
41 Twitter Sentiment Analysis - Learn Python for Data Science #2
Twitter Sentiment Analysis - Learn Python for Data Science #2
Siraj Raval
42 Recommendation Systems - Learn Python for Data Science #3
Recommendation Systems - Learn Python for Data Science #3
Siraj Raval
43 Predicting Stock Prices - Learn Python for Data Science #4
Predicting Stock Prices - Learn Python for Data Science #4
Siraj Raval
44 Pong Neural Network (LIVE)
Pong Neural Network (LIVE)
Siraj Raval
45 Deep Dream in TensorFlow - Learn Python for Data Science #5
Deep Dream in TensorFlow - Learn Python for Data Science #5
Siraj Raval
46 Visualizing Data with D3.js (LIVE)
Visualizing Data with D3.js (LIVE)
Siraj Raval
47 Genetic Algorithms - Learn Python for Data Science #6
Genetic Algorithms - Learn Python for Data Science #6
Siraj Raval
48 Enter Siraj [Music Video]
Enter Siraj [Music Video]
Siraj Raval
49 Build a Web Scraper (LIVE)
Build a Web Scraper (LIVE)
Siraj Raval
50 Why is P vs NP Important?
Why is P vs NP Important?
Siraj Raval
51 How to Make a Neural Network (LIVE)
How to Make a Neural Network (LIVE)
Siraj Raval
52 How to Make an Amazing Tensorflow Chatbot Easily
How to Make an Amazing Tensorflow Chatbot Easily
Siraj Raval
53 How to Make an Amazing Video Game Bot Easily
How to Make an Amazing Video Game Bot Easily
Siraj Raval
54 How to Make a Tensorflow Neural Network (LIVE)
How to Make a Tensorflow Neural Network (LIVE)
Siraj Raval
55 How to Make a Simple Tensorflow Speech Recognizer
How to Make a Simple Tensorflow Speech Recognizer
Siraj Raval
56 Joel Shor - Really Quick Questions with an Awesome Google Engineer
Joel Shor - Really Quick Questions with an Awesome Google Engineer
Siraj Raval
57 How to Make a Path Planning Algorithm Easily (LIVE)
How to Make a Path Planning Algorithm Easily (LIVE)
Siraj Raval
58 The Best Way to Prepare a Dataset Easily
The Best Way to Prepare a Dataset Easily
Siraj Raval
59 Catherine Olsson - Really Quick Questions with an OpenAI Engineer
Catherine Olsson - Really Quick Questions with an OpenAI Engineer
Siraj Raval
60 How to Make a Tic Tac Toe Neural Network Easily (LIVE)
How to Make a Tic Tac Toe Neural Network Easily (LIVE)
Siraj Raval

This video tutorial teaches you how to create word vectors from a corpus of text using Word2Vec and visualize them using dimensionality reduction techniques. You'll learn how to process a dataset from scratch, create word vectors, and use them for natural language processing tasks.

Key Takeaways
  1. Download a corpus of text
  2. Preprocess the text data by tokenizing and removing stop words
  3. Create a Word2Vec model and train it on the preprocessed text data
  4. Use the trained model to create word vectors
  5. Visualize the word vectors using dimensionality reduction techniques like t-SNE
💡 Word2Vec is a powerful technique for creating word vectors that capture semantic similarity between words, and can be used for a variety of natural language processing tasks.

Related AI Lessons

Up next
AI in Care - Katie Furey, Pairly.com
The Access Group
Watch →