Create a Large Language Model from Scratch with Python – Tutorial

freeCodeCamp.org · Intermediate ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering80%Prompt Craft60%

Key Takeaways

This video tutorial demonstrates how to build a large language model from scratch using Python, covering data handling, math, and transformers behind large language models.

Full Transcript

learn how to build your own large language model from scratch this course goes into the data handling math and Transformers behind large language models ellot Arledge created this course he will help you gain a deep understanding of how llms work and how they can be used in various applications so let's get started welcome to intro to language modeling in this course you're going to learn a lot of crazy stuff okay I'm just going to give you a heads up it's going to be a lot of crazy stuff we learn here however it will not be insanely hard I don't expect you have any any experience in calculus or linear algebra uh a lot of courses out there do assume that but I will not we're going to build up from square one we're going to take baby steps when it comes to new uh fundamental concepts in math and machine learning and we're going to take larger steps once things are fairly clear and they're sort of easy to figure out uh that way we don't take forever just taking baby steps through every little concept this course is inspired by Andre kath's uh building a g BT from scratch lecture so shout out to him and yeah we don't assume you have any experienced maybe 3 months of python experience uh just so the syntax is sort of familiar and you can you're able to follow along that way but uh no matter how smart you are how quick you learn the willingness to put in the hours is the most important because this is material you won't normally come across um so as long as you're able to put in that constant effort uh push through these lectures even if it's hard take a quick break grab a snack whatever you need to do grab some water water's very important and yeah hopefully you can make it to the end of this you can do it uh since it's free code Camp everything will be local computation nothing in the realm of paid data sets or cloud computing uh we'll be scaling the data to about 45 GB for the entire uh training data set so I have 90 or reserved so we can download the initial 45 and then convert it to an easier to work with 45 so um yeah if you don't actually have 90 gab reserved that's totally fine you can just download a different uh data set and sort of follow the same uh data pipeline that I do in this video through the course you may see me switch between Mac OS and windows the code still works all the same both operating systems and I'll be using a tool called SSH it's a server that I can connect from my MacBook to my Windows PC that I'm recording on right now and that will allow me to execute run build whatever do anything coding related uh command prompt related on my uh Macbook so I'll be able to do everything on there that I can my Windows computer it'll just look a little bit different for the recording so uh why am I creating this course well like I said before a lot of beginners they don't have the fundamental knowledge like calculus linear your algebra to help them get started or accelerate their learning in this space so I intend to build up from baby steps and then larger steps when things are fairly simple to work with and I'll use logic analogies and step-by-step examples to help concept conceptualize rather than just throw tons of formula at you so with that being said let's go ahead and jump in to the good stuff so in order to develop this project step by step we're going to use something called Jupiter notebooks and you can sort of play with these in the Anaconda prompt or at least launch them from here so an anaconda prompt is just great for anything machine learning related so make sure to have this installed I will link a video in the description so that you can sort of set this up and install it step-by-step guide in there um so what we can do from this point is sort of just set up our project and initialize everything so I'm going to do is just uh head over into my directory that I want which is going to be python testing we're going to make a a directory free code cam GPT course and then from this point uh we're going to go and make a virtual environment so virtual environment it will initially in your desktop you will have uh just all of your python libraries all your dependencies there just floating around and what the virtual environment does is it sort of separates that so you have this isolated environment over here and you can just play around with this however you want and it's completely separate so that won't really uh cross with uh all of the global libraries that you have all the ones that just affect the system when you're not in a virtual environment if that makes sense so we're going to go ahead and set that up right now by using python uh- M and then we're going to go VV for virtual EnV and then Cuda so the reason why we say Cuda here is because uh later when we uh try to accelerate our learning uh or the models learning uh we're going to need to use gpus gpus are going to accelerate this a ton and basically Cuda is just that little feature in the GPU that lets us do that so we're going to make our environment called cuda going to go and press enter it's going to do that for us it's going to take a few seconds so now that that's done we can go and do Cuda and we're just going to basically activate this environment so we can start developing in it we're going to go back SL and we're going to go scripts and then activate so now you can see it says Cuda base so we're in Cuda and then secondary base so it's going to prioritize Cuda so from this point we can actually start installing some stuff some libraries here so we can go pip 3 install uh Matt plot lib numpy uh we're going to use p y LM Za or uh lzma and then what are some other ones we're going to do ipy kernel this is for the actual Jupiter notebooks and uh being able to bring the Cuda uh virtual environment into those notebooks so that's why that's important and then just the actual uh Jupiter notebook feature so go and press enter those are going to install that's going to take a few seconds to do so what might actually happen is you'll get a build error with uh py lzma which is a compression algorithm and don't quote me on this but I'm pretty sure it's based in C++ so you actually need some build tools for this and you can get that with uh Visual Studio build tools so what you might see you might see a little error and basically go to that website and you're going to get this right here so just go ahead and download build tools so it's going to download here you're going to click on that it's going to it's going to set up and then you're going to go ahead and click continue and then at this point uh you can go and click modify if you see this here and then you might get to a little uh workloads section here so once you're at workloads that's good what you're going to make sure is that you have uh these two checked off right here just make sure that you have these two um I'm not sure what desktop particularly does it might help uh but it's just kind of good to have uh some of these build tools on your PC anyways even for future projects so um just get these two for now that'll be good and then you can click modify over here if you wanted to modify just like that and then you should be good to uh rerun that command so from this point what we can actually do is we're going to install torch and we're not just going to do it by using pip install uh pip three install torch we're not going to do it like this what we're actually going to do is we're going to use a separate command and this is going to install Cuda with our uh torch so it's going to install the Cuda extension which will allow us to utilize the GPU so it's just this Command right here and if you want to find like uh a good command to use what you can do is go to the uh P torch docs uh just go go to get started and then uh you'll be able to see this right here so we have stable uh Windows pip Python then Cuda 11.7 or 11.8 so I just clicked on this and since we aren't going to be using uh torch Vision or torch audio I basically just did pip 3 install torch and then with this index URL for the uh Cuda 11.8 so that's pretty much all we're doing there to install Cuda that's a part of our torch so we can gohe and click uh enter on this so great uh We've installed a lot of things uh a lot of libraries a lot of setup has been done already uh what I want to check now is just to make sure that our python version is what we want so python version 3.10.9 that's great if you're between 3.9 3.10 3.11 uh that's perfect so if you're in between those should be fine uh at this point we can just jump right into our Jupiter notebook so the command for that is just Jupiter notebook spelled like that click enter it's going to send us into here and I've created this little bam. iynb here uh in my vs code so uh pretty much you need to actually type some stuff in it and you need to make sure that it has the IP ynb uh extension or else it won't work so if it's just ipy andb and doesn't have anything in it uh I can't really read that file for some reason and yeah so just just make sure you type some stuff in it open that in vs code Type like I don't know AAL 3 or St Str equals banana I don't care uh at this point let's go ahead and pop into here so this is what our our notebook's going to look like and we're going to be working with this quite a bit uh throughout this course so what we're going to need to do next here is make sure that our virtual environment is actually inside of our notebook and make sure that we can interact with it from this uh kernel rather than just through the command prompt so we're going to go and check here and I have a virtual environment here uh you may not but all we're going to do is basically go into here we're going to end this and all we're going to do is we're going to go ah and do python uh DM and then ipy uh kernel install user and you'll see why we're doing this in the second user name equals Cuda this is from the virtual environment we initialized before so that's the name of the virtual environment and then the display name how it's actually going to look in the terminal is going to be uh display name uh we'll just call it um Cuda GPT I don't know that sounds like a cool name and then we'll go and press enter it's going to make this environment for us great installed good so we can go and run our notebook again and we'll see if this changes so we can go and pop into our Byram again kernel change kernel boom Cuda gbt let's click that sweet so now we can actually start um doing more and just sort of experimenting with how the notebooks work and actually how we can build up this biogram model and sort of learning how uh language models work from scratch so let's go ahead and do that so before we jump into this actual code here what I want to do is uh Delete all of these good so now what I'm going to do is just get a small little data set just very small for us to work with that we can sort of try to uh make a Bagram out of something very small so what we can do is go to this uh website called Project Gutenberg and they basically just have a bunch of free books that are uh licensed under Creative Commons so we can use all of these for free so let's use uh The Wizard of Oz the end The Wizard of Oz great so what we're going to want to do is just click on Plain text here great so now we can go uh contrl s to save this and then we could just go Wizard of Oz wizardcore ofor Oz good so now what I'm going to do is we should probably drag this [Music] into we should drag this into our folder here so I'm just going to pop that into there good stuff did that work sweet so now we have our Wizard of Oz text in here we can open that uh what we can do is start of this book okay so we can go ahead and we go down to when it starts uh sweet so maybe we'll just cut it here that' be a good place to start just like that then put a few spaces good so now we have this book uh go to the bottom here just to get rid of some of this other licensing stuff which is might get in the way with our predictions in the in the context of the entire book so let's just go down to when that starts end of the book okay Okay so we've gotten all that that is done of the illustration there perfect so now we have this Wizard of Oz text that we can work with let's close close that up 233 kiloby awesome very small size we can work with this this is great so we have this wizard of o. txt file and what are we going to do with that well we're going to try to train uh a Transformer or at least a biogram language model on this text so in order to do that we need to sort of learn how to manage this text file how to open it Etc so we're going to go ahead and open this going to do wizard boss like that and we're going to open in read mode and then we're going to use the encoding utf8 just like that so uh this is the file mode that you're going to open in uh there's read mode there's write mode there's read by name there's right binary uh and those are really the only ones we're going to be worrying out wor worrying about for this video um the other ones you can look into in your spare time if you'd like to but we're just going to be using those for for now uh and then the encoding is just what type of character coding are we using uh that's pretty much it and we can just open this as F short for file we're going to go text equals f. read just going to read this file stored in a a string variable and then we can you know we can print can print some stuff about it so we can go print the length of uh print the length of this text run that we get the length of the text um we could print the first uh 200 characters of the text sure so get the first 200 characters great um so now we know how to you know just play with characters um at least just see what the characters actually look like so now we can do a little bit more from this point which is going to be uh encoders and uh before we get into that what I'm going to do is put these into a little vocabulary list that we can work with so all I'm going to do is I'm going to say we're going to make a a chars variable so the charge is going to be all the chars all the characters um in this text piece so we're going to make a uh sorted set of text here and we're going to just uh print out chars so look at that we have a giant array of all these characters so now we can what we can do is we can use something called A tokenizer and A tokenizer consists of an encoder and a decoder what an encoder does is it's actually going to convert each character or sorry each element of this array to an integer so maybe this would be a zero uh this would be a one right so a new a new line or an enter would be uh a zero a space would be a one exclamation mark would be a two Etc right all the way to the length of them and then what we could do is we could even uh we could even print the length of these characters so we can see how many they actually are so there's 81 characters in the entire in in the entire Wizard of Oz book so I've WR WR some code here that is going to do that job for us the job of tokenizers so what we do is we just use a little generator some generator for Loops here uh generative for for Loops rather and we make a little mapping from strings to integers and integers to Strings uh given the vocabulary so we just enumerate through each of these um we have one assign first element assigned to a one second assigned to it to Etc right that's basically all we're doing here and we have an encoder and a decoder so let's say we wanted to uh convert uh the string hello to integers so we go encode and we could do hello just like that and then we could uh go ahead and print this out perfect let's go ahead and run that boom so now we have a conversion from characters to integers and then if we wanted to maybe convert this back so decode it we could store this in a little maybe decoded uh hello equals that and then we could go uh or encoded rather encoded hello and then we could go uh decoded uh hello is equal to we go decode and we can use the encoded hello so we're going to go ahead and encode this into integers and then we're going to decode the integers back to uh a character format so uh let's go ahead and print that out we're going to go ahead and print the decoded hello perfect so now we get that so I'm going to fill you in on a little background information about these tokenizers so right now we're using the Character level tokenizer which takes basically each character and uh converts it to an integer equivalent so we have a very small VOC ulary and a very large amount of uh tokens to convert so if we have 40,000 individual characters that means we have a small vocabulary to work with but a lot of characters to encode and decode right if we have if we work with maybe a word level tokenizer that means we have a ton like every single word in the English language I mean if you're working with uh multiple languages this could be like you know a lot very large amount of uh tokens so you're going to have like maybe millions or billions or trillions if you're if you're doing something weird but in that case you're going to have a way smaller uh set to work with so you're going to have very large vocabulary but a very small amount to encode and decode so if you have a subword tokenizer that means you're going to be somewhere in between a character level and a Word level tokenizer if that makes sense so in the context of language models it's really important that we're efficient with our data and just having a giant string might not work the best and we're going to be using a machine learning framework called pytorch or torch so I've imported this right here and pretty much what this going to do is it's going to handle a lot of the math a lot of the calculus for us as well a lot a lot of the linear algebra which involves uh a type of data structure called tensors so tensors are pretty much matrices if you're not familiar with those that's fine we'll go over them more in the in the course but pretty much what we're going to do is we're going to just put everything inside of a tensor so that it's easier for pytorch to work with so I'm going to go ahead and delete these here and what we can do is just make our data element we could this is going to be the entire uh text Data of the entire Wizard of Oz so we could go ahead and make this uh data equals and we're going to go torch. tensor and then we're going to go uh encode we're going to put the text inside of that so we're going to go ahead and encode uh this text right here and we're going to make sure that we have the right data type which is a torch. long uh data data type equals torch. long so this basically means we're just going to have this as a uh super long sequence of integers and yeah let's go see what we can do with this uh torch. tensor element right here so I've just written a little print statement where we can just print out the first 100 characters or 100 integers of this data so it's it's pretty much the same thing in terms of working with arrays it's just a different uh type of data structure uh in the contents in the context of P torch sort of easier to work with in that way uh pyour is just primarily revolved around tensors and modifying them uh reshaping changing dimensionality multiplying doing Dot products uh which I mean that sounds like a lot but uh we're going to go over some of this stuff later in the course just about how to do all this math we're going to actually go over examples on you know how to how to multiply this matrix by this Matrix even if they're not the same shape and even dot proding dot producting that kind of stuff so next one I'm going to talk about is something called uh validation and training splits so why don't we just you know use the entire text document and only train on that the entire Text corpus why don't we train on that well the reason we actually split into training and validation sets I'm going to show you right here so we have this giant Text corpus it's a super long text file think of it as a you know an essay but a lot of pages so this is our entire Corpus and we make our training set you know 80% of it so maybe this much and then the other validation is this 20% right here okay so if we were to just train on the entire thing after a certain number of iterations it would just memorize the entire text piece and it would be able to you know simply write it just write it out it would have it in the entire thing memorized and it wouldn't really get anything useful out of that it would only know this document but what the purpose of language modeling is is to generate that's like the training data and this is exactly why we put it into splits so if we if we uh run our our training split right here it's only going to know 80% of that entire Corpus and it's only going to generate on that 80% instead of the entire thing and then we have our other 20% Which Only Knows 20% of the entire Corpus so the reason why we do this is to make sure that the generations are unique and not an exact copy of the actual document we're trying to generate text that's like the document like for example in Andre Kathy's lecture he trains on Shakespearean text an entire uh piece of Shakespeare and the point is to generate a Shakespearean like text but not exactly what it looked like not that exact you know 40,000 lines or like a few thousand lines of that entire Corpus right we're trying to generate text that's like it so that's the entire reason or at least that's most of the reason why we use uh train and valve splits so you might be wondering you know like why is this even called the Byram language model and I'm actually going to show you how that works right now so we go back to our whiteboard here I've drawn a little sketch so if we have this piece of content the word hello let's just say it we don't have to encode it as any integers right now we're just working with characters um pretty much we have two right so buy means two the by prefix means two so we're going to we're going to have a Byram so given maybe I mean there's nothing before an H in this content so we just assume that's the start of content and then that's going to point to an H so H is the most likely to come after U the start and then maybe given an H we're going to have an e then give it an e we're going to have an L then give it an L we're going to have another L and then L leads to O right so maybe there's going to be um some probabilities associated with these so that's pretty much how it's how it's going to predict right now it's only going to consider the previous character to predict the next so we have given this one we predict the next so there's two which is why it's called Byram language model so ignore my terrible writing here but we're actually going to go into how we can train the Byram language model to do what we want how we can actually uh implement this into a neural network an artificial neural network uh and train it so we're going to get into something called block size which is pretty much just taking a random snippet out of this entire Text corpus here just a small snippet and we're going to make some predictions and we're going to make some targets out of that so our block size is just a bunch of encoded characters or integers that we have predictions and targets so let's say we take uh a small little size of maybe block size of five okay so we have this uh tiny little tensor of five integers and these are our predictions so given uh some context right here we're going to be predicting these and then we have our targets which would be uh offset by one so notice how here we have a five and then here the five is outside and then this 35 is outside here and now it's inside so all we're doing is just taking that block uh from the predictions in order to get the targets we just offset that by one so we're going to be accessing the same indices so at index0 is going to be five index 0 is going to be 67 right so 67 is following five in the bgram language model so that's pretty much all we do we just look at uh how much of a difference is that uh Target away from or how much far is the prediction away from the Target and then uh we can optimize for reducing that error so the most basic python implementation of this this in uh the Character level tokenizers or the Character level um tokens rather would be just simply this right here so we would we would take um we would take a little snippet random uh it would be pretty much just from the start or some some whatever just some snippet all the way from the start of the snippet up to uh block side so five ignore my terrible writing again um and then this one would just be it would just be one up to uh block size or five plus one so it be up to six right and that's that's pretty much all we do this is exactly what it's going to look like in the code so I've written some code here that does exactly what we just talked about uh in Python so I've defined this block size equal to 8 just so you can kind of see what this looks like on a on a larger scale uh a little bit larger and just what we wrote uh right there in the jupter notebook this uh uh position zero up to block up to block size and then offset by one so we make it uh position one up to block sides plus one little offset there uh we pretty much just wrote that in here X as our Productions as and Y as our targets and then just uh a little for Loop to show um what the prediction and what the targets are so this is what this looks like in Python great we can do predictions but this isn't really scalable yet this is sequential right sequential it is another way of describing what the CPU does CPU can do a lot of complex operations very quickly but it only happens sequentially it's this one and then this task and this task and this task right but with gpus you can do a little bit more simpler task but very very quickly or in parallel so we can do a bunch of very uh small or not computationally uh complex computation uh in a bunch of different little processors that aren't as good but there's tons of them so pretty much what we can do is we can take each of these uh little blocks and then we can stack them and push these to the GPU to scale our training a lot so I'm going to illustrate that for you right now so let's just say we have a block okay block looks like this and we have some uh we have some integers in between here okay so this is a block okay now if we want to make multiple of these we're just going to stack them so we're going to make another one another one another one so let's say we have four batches okay or sorry four blocks so we have four different blocks that are just uh stacked on top of each other and we can represent this as a new hyper parameter called batch size this is going to tell us uh how many of these sequences can we actually process in parallel so the block size is the length of each sequence and the batch sizes how many of these are we actually doing at the same time so this is a really good way to scale language models and without these you can't really expect any fast training or good performance at all so we just went over uh how we can actually get batches or rather how we can use batches to accelerate the training process and uh we can it just takes one line to do this actually so all we have to do is uh call this little function here saying uh if Cuda do torch. Cuda is available we'll just check if the GPU is available uh based on uh your Cuda installation and if it's available like it says if it's available uh we'll set the device to Cuda else CPU so we're going to go and print out the device here so that's going to run and we get Cuda so that means uh we can use the GPU for a lot of our processing here and while we're here I'm actually going to move up this hyper parameter block size up to the top block size and then we're going to use batch size which is how many blocks uh we're doing in parallel and we're just going to make this four for now so these are our two hyper parameters that are very very important for training and you'll see that why these become much more important later when we scale up the data and use uh more complex mechanisms to train and learn the patterns of the language based on the text uh that we give so if it doesn't work right away if if the new jupyter notebook doesn't work right away uh I'd recommend just uh hitting contrl C to cancel this hit it a few times might not work the first it'll shut down and you just go up Jupiter notebook again and then enter and then after this is done uh you should be able to just restart that and it will work hopefully there we go so I can go ahead and uh restart and output and we can run that see we get boo so awesome now let's try to do some actual cool pie torch stuff so we're going to go ahead and import uh torch here and then uh let's go ahead and try this uh Rand int feature so we go Rand in uh we'll do equals torch. randint and then let's say we go minus uh 100 100 to 100 and then in Brackets we go uh six just like that so if we want to print this out here uh or we can just go rant like that uh could run this block first good and boom so we get a tensor type and all these numbers are we have we have six of them so one 2 3 4 5 6 and they're between 100 and 100 so we're going to have to keep this in mind right here when we're getting our random batches from this giantex Corpus so let's try out a new one let's just try uh we can make we can make tensors we've done this before so we could do tensor equals torch. tensor and we could go um 0.1 [Music] 1.2 uh here I'll just copy and paste one right here here so we do this boom and we can just do tensor and we'll get exactly this so boom we get a uh 3x 2 Matrix now we're going to try a different one called zeros so zeros is just torch. zeros and then inside of here we can just do uh the dimensions or the shape of this so uh 2x3 and then we can just do zeros and then go ahead and run that so we get a 2x3 of zeros and these are all floating Point numbers by the way um maybe we could try ones now I know ones is pretty fun on so we go torch torch. On's it's pretty much the same as zeros uh we could just do like maybe 3 by four and then print that ones out so you have a 3x4 of ones sweet so what if we do input equals torch. empty uh can make this 2x3 but so these are interesting these are these are pretty much uh a bunch of very either very large or very small numbers um haven't particularly found a use case for this yet but just another feature that P torch has uh we have arrange so we go arrange equals torch. arrange and we can do like five for example just do rrange so now we have a tensor um just sorted uh zero or rather starting at zero up to four so five just just like that um do line space equals uh torch. line line space spelling is weird do three 10 and then steps for example equals 5 this all makes sense in a second here go run and we get a line space so steps equals five so we have five different ones boom boom boom boom boom and we go all the way from 3 to 10 so pretty much getting all of the constant increments from three all the way up to 10 over five steps so uh you're doing you're basically adding the same amount every time so 3 + 1.75 is 4.75 then plus another uh 1.75 is 6.5 and then 8.25 and then 10 right so just over five steps uh we want to find what that constant increment is so that's a pretty cool one um and then we have we'll do log space which is interesting log space equals torch. logspace and then we'll go uh start uh start equals -10 end equals 10 uh the these are these are both start and end so you can either put these here uh you can either put the start with them start equals or you don't have to uh it's honestly up to you and then uh we can put our steps again so steps equals maybe five let's go and run that or oops to to put log space there so we get that so we start at um 1 the -10 and then we just do this in little increments here so it goes 10 5 0 + 5 10 just over five steps so that's pretty cool um what else do we have here so we have I torch. I I just have all these on my second screen here so uh bunch of examples just written out and we're just kind of visualizing what these can do and maybe you might even have your own creative little Sparks of um thought that you're going to maybe find something else that you can use these for for your own personal projects or whatever you want to do so we're just kind of experiment experimenting with these uh what we can do with the basics of high torch and some of the very basic functions so uh do I we go uh print this I out here so we get pretty much just a diagonal line and it's it's in five so you get a 5x5 Matrix and yeah pretty much just uh reduced row uh each long form I don't know how to pronounce it but uh that's pretty much what it looks like so pretty cool stuff um um let's see what else we have we have empty like we have empty like uh torch. empty like uh a and then we'll just say maybe make a equal to we'll make it a torch. empt and then we can go uh 2 by3 and then uh data type torch. int 64 so 64bit integers uh and then let's see what happens here empty boom so that's pretty cool what else do we have yes we can do timing as well so I'm just going to erase all of these uh you can I mean you can scroll back in the video just look and maybe experiment with these a little bit try a little bit more than just you know what I've done with them maybe modify them a little bit um but yeah I'm actually going to delete all of these here so we just do and then we can go ahead and do the device equals uh Cuda and we're going to go ahead and switch this over to the Cuda jpt uh environment Cuda if if torch. Cuda uh underscore is Cuda is [Music] available uh and then else go CPU print out our device here go and run this coda sweet so we're going to try to do stuff with uh the GPU now compared to the CPU and really see how much of a difference uh Cuda or the GPU is going to make in comparison to the CPU when we change uh the shape and dimensionality and we're just doing different um experiments with a bunch of different tensors so in order to actually measure the difference between the GPU and the CPU uh I just imported a library called time so this comes with the operating system or sorry with with python uh you don't have to actually install this uh manually so um basically what we do is we whenever we call time.time uh and then uh parentheses it will just take uh the current time snippet right now so start time will be like right now and then end time maybe 3 seconds later will be you know right now plus 3 seconds so if we subtract end time at start time we'll get a 3 second difference and that would be the total elapse time and then uh this little number here this four will be uh just how many decimal places we have so I can go ahead and run this here time is not defined let's run that first boom it's going to take you know almost no time at all so we can actually increase this if we want to 10 and then run that again again it's you know we're making a pretty much a 1 by one matrix it's a just a it's just a zero so um we're not really going to get anything significant from that um but anyways for for actually testing the difference between the GPU and the CPU what we're going to worry about is that iterative process the process of forward pass and uh back propagation through the network that's primarily what we're trying to optimize for actually pushing all these parameters and all these um model weights uh to the GPU isn't really going to be the problem it'll take maybe a few seconds at most like maybe 30 seconds to do that and that's not going to be any time at all in the entire training process so what we want to do is just see you know which is better numpy on the CPU or uh torch using Cuda on the GPU so I have some code for that right here so we're going to initialize a bunch of matrices here so or sorry tensors and we have uh just basically random ones so we have a 10,000 x 10,000 uh all random all random floating Point numbers and then we're going to push these to the GPU and we have two of these and then same thing for numpy so in order to actually multiply matrices with P torch we need to use this at symbol here so we multiply these and we get this new uh we get this new uh random tensor and then uh we stop it and then we do the same thing over here except we use np. multiply so if I go ahead and run these it's going to take a few seconds to initialize these and or not even a few seconds and then we have see look at that so for the uh GPU it took a little while to do that and then for the CPU it didn't take as long so this is because there's it the shape of these matrices are not really that big they're just two-dimensional right so it's the C this is something that the CPU can do very quickly because there's not that much to do but let's say we want to bump it up a notch so if we go to 100 100 100 and then maybe we'll throw in another 100 there hopefully that works and then we can do uh we'll just do the same thing so we'll just paste this so now if we try to run this again you'll see that the GPU actually took less than half the time that the CPU did and this is because uh there's you know a lot more going on here there there's a lot more simple uh multiplication to do so the reason why this is so significant is because when we have you know millions or billions of parameters in our language model we're not going to be doing uh very complex operations between all these tensors they're going to be very similar to what we saw in here the the dimensionality and shape is going to be very similar to what we're seeing right now you know maybe three or four dimensions uh and it's going to be very easy for a GPU to do this they're not complex tasks that we need the CPU to do they're not very hard at all so when we uh give this task to parallel processing it's going to be a ton quicker so you're going to see why this matters later in the course you're going to see this with uh some of the hyperparameters we're going to use which I'm not going to get into quite yet but uh over the next little bit you're going to see why the GPU is going to matter a lot for uh increasing the efficiency of that iterative process so this is great now you know a little bit more about why we use uh the GPU instead of the CPU for uh training efficiency so there's actually another term that we can use called a percentage percentage time I don't know if that's exactly how you're supposed to uh call it but uh that's what it is and pretty much what it'll do is time how long it takes to execute a block so we can see here there's CPU times uh 0 NS the n is for Nano billionth of a second is a nanc and then wall time so CPU time is how long it takes to uh execute on the CPU the time that it's doing operations for and then the wall time would be how long it actually takes like in real time how long do you have to wait do you have to wait until it's finished so the only thing that the CPU CPU time doesn't include is waiting so in an entire process there's going to be some operations and there's going to be some waiting wall time is going to have the uh both of those and CPU time is just the execution so let's go ahead and continue with uh some of the basic pytorch functions so I've written some stuff down here so we're going to go over uh torch. stack torch. M multinomial torch. Trill uh triou I don't think that's how you pronounce it but we'll get into that more uh transposing uh linear concatenating and the softmax function so let's first start off here with the the tor. momal so this is essentially a probability distribution based on the index that you give it so we have probabilities here we say 0.1 and 0.9 these numbers have to add up to one to make 100% 100% is one one whole so I have 10% and 90% this is at index zero so there's a 10% chance that we're going to get a zero and a 90% chance that we're going to get a one so if I go ahead and run these up here give this a second to do its thing so we you can see that uh in the end we have our numb sample set to 10 so it's going to give us 10 of these 1 2 3 4 5 6 7 9 10 and all of them are ones if we run it again we might get slightly different results see now we have some zeros in there but the zeros have very low probability of happening as a matter of fact exactly a 10% probability of happening so uh we're going to use this later in uh predicting what word is going to come next let's move on to torch got torch. cat or short for torch. concatenate so this will essentially concatenate two tensors into one so I initialize this tensor here torch. tensor uh 1 2 3 four it's onedimensional and we have another tensor here that just contains five so if we concatenate 1 2 3 4 and five then we get uh 1 2 3 4 5 we just combine them together and uh this is what'll come out in the end so I run that 1 2 3 4 5 perfect so this is going to we're going to actually use this when we're generating when we're generating text given a context so it's going to start uh it's going to start from zero we're going to use our probability distribution to pick the first one and and then uh based on the first one we're going to uh you know we're going we're going to predict the next character and then once we have predicted that we're going to concatenate uh the new one with the ones that we've already predicted so we have this maybe like a 100 characters over here and then the next character that we're predicting is over here we just concatenate these and by the end we will have all of the uh integers that we've predicted so next up we have t. Trill and what this stands for what the trill stands for is uh triangle lower so it's going to be in a sort of a triangle formation like this diagonal it's going to go be going from uh top left to bottom right and so you're going to see a little bit more why later in this course but this is important because when you're actually trying to predict uh integers or next tokens in the sequence you have you only know what's in the current history we're trying to predict the future so giving the answers in the future uh isn't what we want to do at all so maybe we've just predicted one and the rest of them we haven't predicted yet so we set all these to zero and then we predicted another one and these are still zero so these are talking to each other in history and as and as our predictions add up uh we have more and more history uh to look back to and future right um basically the premise of this is just making sure we can't communicate with the answer we can't predict while knowing what the answer is just like when you write an exam you can't use the answer stre they they don't give you the answer sheet so you have to know based on your uh history of knowledge which answers to predict and that's all that's going on here and we have I mean you could probably guess this triangle upper so we have all the upper ones these are you know lower on the lower side and then these are on the upper side so same concept there and then we have a masked fill so this one's going to be very important later because in order to actually get to this point all we do is we just exponentiate every element in here so if you exponentiate zero if you exponentiate zero it'll become one if you exponentiate negative Infinity it'll become Z all that's going on here is we're doing uh approximately 2.71 and this is a constant that we use in uh the the EXP function and then we're putting this to whatever uh Power is uh in that current slot so we have a zero here so 2.71 to the zeroth is equal to 1 2.71 to the 1 is equal to 2.71 and then uh 2.71 to the negative uh infinity is of course zero so that's pretty much how we get from this to this and uh we're just we're simply just masking these over so that's great and I sort of showcase what uh the EXP does we're just using this one right here we're using this this output and we're just plugging it into here so uh it'll go from negative Infinity to 0 and then 0 to one so that's how we get from here to here now we have uh transposing so transposing is when we sort of Flip or swap the dimensions of a tensor so in this case I initialize a torch. Zer tensor with Dimensions 2x 3x 4 and we can use the transpose function to essentially flip uh any Dimensions that we want so what we're doing is we're looking at the zeroth as it sounds weird does not say First Dimension but we're pretty much swapping the zeroth position with the second so 01 2 we're swapping this one with this one so the end result like you would probably guess the shape of this is going to be 432 instead of two three four so you can kind of just take a look at this and see you know which ones are being flipped and uh those are the dimensions and that's the output so hopefully that makes sense next up we have torch. stack and this is what we're actually going to go uh we're going to we're going to do more of this we're actually going to use torch. stack stack very shortly here when we're uh getting our batches so remember before when I was talking about batch size and how we take a bunch of these blocks together and we just stack them a giant uh a giant uh length of integers or tokens and all we're doing is we're just stacking them together in blocks or to make a batch so that's pretty much what we're going to end up doing and that's what torch. stack does so we can take something that's um maybe onedimensional and then we can stack it to make it two-dimensional we can take something that's two-dimensional and stack it a bunch of times to make it threedimensional or we can say threedimensional for example we have a bunch of cubes and we stack those on top of each other now it's four-dimensional so hopefully that makes sense all we're doing is we're just passing in each tensor that we're going to stack in order so this is our little output here and that's pretty much all it is the next function that's going to be really important uh for our model and we're going to be using this the entire time uh from start to finish it's really important it's called the nn. linear function so it is a pretty much a function of the nn. module and this is really important because you're going to see later on nm. module is it contains anything that has uh learnable parameters so when we do a transformation to something when we apply a weight and a bias in this case it'll be false but uh pretty much when we apply a weight or a bias uh under an. module it will learn those and it'll become better and better and it'll basically train based on uh how accurate those are and and uh how close certain parameters bring it to the desired output so pretty much anything with n and. linear uh is going to be very important and it's going to be learnable so we can see over here um this is the torch.nn uh little site here on the docs so we have containers a bunch of different layers like activations layers uh pretty much just layers that's all it is and so these are these are important we're going to that we're basically going to learn from these and you're going to see why we're going to use something called keys and values uh Keys values and quer later on you'll see why those are important but uh if that doesn't make sense yet help me let let me illustrate that for you right now so I drew this out here so if we look back at our examples we have a uh we make we initialize a tensor um it's 10 10 and 10 what we're going to do is we're going to do a linear trans this linear stands for linear transformation so pretty much we're just going to apply a weight and a bias through each of these layers here so we have an input and we have an output X is our input Y is our output and this is of size three and this is up size three so pretty much we need just need to make sure that these are lining up and uh for more context the nn. sequential is sort of built off nn. linear so if we go ahead and search that up right now this will make sense in a second here this is also some good prerequisite knowledge in general for machine learning so let's see nn. sequential uh doesn't show it here but pretty much um if you have let's say I know two you have two input neurons and maybe you have one output neuron okay you have a bunch of hidden layers in between here let's say we have maybe 1 2 3 4 and then one 2 3 so pretty much you need to make sure that the input uh aligns with this hidden layer this hidden layer aligns with this one and this one aligns with this one so you're going to have um a transformation of 2 to four so two four and then this one's going to be um 4 to 3 4: 3 and then you're going to have a final one this is 2: four right here 4: 3 here and then this final one it's going to be 3: one so you pretty much just need to make sure that these are lining up so we can see that we have two four and then this four is carried on from this uh output here and pretty much this will just make sure that our shapes are consistent and of course if they aren't consistent if the shapes don't work out the math simply won't work so we need to make sure that our shapes are consistent uh if that didn't make sense I know I'm not like super great at explaining uh architecture your nuts but if you're really interested you could use uh chat GPT of course and that's a really good Learning Resource uh chat GPT going on GitHub discussions maybe or just looking at documentation uh and if you're not good at reading documentation then you could take maybe some some little keywords from here like uh a sequential container well what is a sequential container you can ask chat GPT those typ of questions and just sort of reverge engineer the documentation and figure things out step by step it's really hard to

Original Description

Learn how to build your own large language model, from scratch. This course goes into the data handling, math, and transformers behind large language models. You will use Python. ✏️ Course developed by @elliotarledge 💻 Code and course resources: https://github.com/Infatoshi/fcc-intro-to-llms Join Elliot's Discord server: https://discord.gg/pV7ByF9VNm Elliot on X: https://twitter.com/elliotarledge ❤️ Try interactive Python courses we love, right in your browser: https://scrimba.com/freeCodeCamp-Python (Made possible by a grant from our friends at Scrimba) ⭐️ Contents ⭐️ (0:00:00) Intro (0:03:25) Install Libraries (0:06:24) Pylzma build tools (0:08:58) Jupyter Notebook (0:12:11) Download wizard of oz (0:14:51) Experimenting with text file (0:17:58) Character-level tokenizer (0:19:44) Types of tokenizers (0:20:58) Tensors instead of Arrays (0:22:37) Linear Algebra heads up (0:23:29) Train and validation splits (0:25:30) Premise of Bigram Model (0:26:41) Inputs and Targets (0:29:29) Inputs and Targets Implementation (0:30:10) Batch size hyperparameter (0:32:13) Switching from CPU to CUDA (0:33:28) PyTorch Overview (0:42:49) CPU vs GPU performance in PyTorch (0:47:49) More PyTorch Functions (1:06:03) Embedding Vectors (1:11:33) Embedding Implementation (1:13:06) Dot Product and Matrix Multiplication (1:25:42) Matmul Implementation (1:26:56) Int vs Float (1:29:52) Recap and get_batch (1:35:07) nnModule subclass (1:37:05) Gradient Descent (1:50:53) Logits and Reshaping (1:59:28) Generate function and giving the model some context (2:03:58) Logits Dimensionality (2:05:17) Training loop + Optimizer + Zerograd explanation (2:13:56) Optimizers Overview (2:17:04) Applications of Optimizers (2:18:11) Loss reporting + Train VS Eval mode (2:32:54) Normalization Overview (2:35:45) ReLU, Sigmoid, Tanh Activations (2:45:15) Transformer and Self-Attention (2:46:55) Transformer Architecture (3:17:54) Building a GPT, not Transformer model (3:19:46) Self-Attention Deep Dive (3:25:

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from freeCodeCamp.org · freeCodeCamp.org · 0 of 60

← Previous Next →

React: Production Server Setup Part 2 - Live Coding with Jesse

React: Production Server Setup Part 2 - Live Coding with Jesse

freeCodeCamp.org

cookies vs localStorage vs sessionStorage - Beau teaches JavaScript

cookies vs localStorage vs sessionStorage - Beau teaches JavaScript

freeCodeCamp.org

Browser history tutorial - Beau teaches JavaScript

Browser history tutorial - Beau teaches JavaScript

freeCodeCamp.org

Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)

Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)

freeCodeCamp.org

React: Parameterized Routing with Next.js - Live Coding with Jesse

React: Parameterized Routing with Next.js - Live Coding with Jesse

freeCodeCamp.org

React: Dealing with jQuery Issues - Live Coding with Jesse

React: Dealing with jQuery Issues - Live Coding with Jesse

freeCodeCamp.org

setInterval and setTimeout: timing events - Beau teaches JavaScript

setInterval and setTimeout: timing events - Beau teaches JavaScript

freeCodeCamp.org

Browser and Device Testing - Live Coding with Jesse

Browser and Device Testing - Live Coding with Jesse

freeCodeCamp.org

Last Minute Updates - Live Coding with Jesse

Last Minute Updates - Live Coding with Jesse

freeCodeCamp.org

Post Launch Updates - Live Coding with Jesse

Post Launch Updates - Live Coding with Jesse

freeCodeCamp.org

React: Setting Up Google Analytics - Live Coding with Jesse

React: Setting Up Google Analytics - Live Coding with Jesse

freeCodeCamp.org

React: Masonry Layout - Live Coding with Jesse

React: Masonry Layout - Live Coding with Jesse

freeCodeCamp.org

Load Balancing Digital Ocean Droplets - Live Coding with Jesse

Load Balancing Digital Ocean Droplets - Live Coding with Jesse

freeCodeCamp.org

try, catch, finally, throw - error handling in JavaScript

try, catch, finally, throw - error handling in JavaScript

freeCodeCamp.org

Load Balancing: SSL Passthrough Setup - Live Coding with Jesse

Load Balancing: SSL Passthrough Setup - Live Coding with Jesse

freeCodeCamp.org

Graphs: breadth-first search - Beau teaches JavaScript

Graphs: breadth-first search - Beau teaches JavaScript

freeCodeCamp.org

React: Masonry Layout Part 2 - Live Coding with Jesse

React: Masonry Layout Part 2 - Live Coding with Jesse

freeCodeCamp.org

React: WordPress API Live Search - Live Coding with Jesse

React: WordPress API Live Search - Live Coding with Jesse

freeCodeCamp.org

Creating WordPress Custom Post Types - Live Coding With Jesse

Creating WordPress Custom Post Types - Live Coding With Jesse

freeCodeCamp.org

Dates - Beau teaches JavaScript

Dates - Beau teaches JavaScript

freeCodeCamp.org

Miscellaneous Front End Updates - Live Coding with Jesse

Miscellaneous Front End Updates - Live Coding with Jesse

freeCodeCamp.org

Merging a Pull Request from GitHub - Live Coding with Jesse

Merging a Pull Request from GitHub - Live Coding with Jesse

freeCodeCamp.org

React + Prettier + Standard JS - Live Coding with Jesse

React + Prettier + Standard JS - Live Coding with Jesse

freeCodeCamp.org

React: Sortable Responsive Table - Live Coding with Jesse

React: Sortable Responsive Table - Live Coding with Jesse

freeCodeCamp.org

Geolocation Sorting by Distance - Live Coding with Jesse

Geolocation Sorting by Distance - Live Coding with Jesse

freeCodeCamp.org

Tradeoff Matrix - Agile Software Development

Tradeoff Matrix - Agile Software Development

freeCodeCamp.org

The Definition of Ready - Agile Software Development

The Definition of Ready - Agile Software Development

freeCodeCamp.org

Getting first React job without experience - Ask Preethi

Getting first React job without experience - Ask Preethi

freeCodeCamp.org

React: Google Analytics Click Tracking - Live Coding with Jesse

React: Google Analytics Click Tracking - Live Coding with Jesse

freeCodeCamp.org

Submitting a PR to an Open Source Project - Live Coding with Jesse

Submitting a PR to an Open Source Project - Live Coding with Jesse

freeCodeCamp.org

Should I go back to school to get CS degree? - Ask Preethi

Should I go back to school to get CS degree? - Ask Preethi

freeCodeCamp.org

Hero Section CSS Changes - Live Coding with Jesse

Hero Section CSS Changes - Live Coding with Jesse

freeCodeCamp.org

Working Agreement - Agile Software Development

Working Agreement - Agile Software Development

freeCodeCamp.org

A day at Pennybox with Co-Founder Reji Eapen

A day at Pennybox with Co-Founder Reji Eapen

freeCodeCamp.org

React: Sorting and Filtering Data - Live Coding with Jesse

React: Sorting and Filtering Data - Live Coding with Jesse

freeCodeCamp.org

React: Sorting and Filtering Data Part 2 - Live Coding with Jesse

React: Sorting and Filtering Data Part 2 - Live Coding with Jesse

freeCodeCamp.org

React: Building a New UI - Live Coding with Jesse

React: Building a New UI - Live Coding with Jesse

freeCodeCamp.org

Definition of Done - Agile Software Development

Definition of Done - Agile Software Development

freeCodeCamp.org

Getting started with jQuery (tutorial) - Beau teaches JavaScript

Getting started with jQuery (tutorial) - Beau teaches JavaScript

freeCodeCamp.org

Making a React Blog with WordPress Content - Live Coding with Jesse

Making a React Blog with WordPress Content - Live Coding with Jesse

freeCodeCamp.org

React, NextJS, CSS - Live Coding with Jesse

React, NextJS, CSS - Live Coding with Jesse

freeCodeCamp.org

jQuery events - Beau teaches JavaScript

jQuery events - Beau teaches JavaScript

freeCodeCamp.org

React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse

React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse

freeCodeCamp.org

React: Working with API Data - Live Coding with Jesse

React: Working with API Data - Live Coding with Jesse

freeCodeCamp.org

React: Refactoring Components - Live Streaming with Jesse

React: Refactoring Components - Live Streaming with Jesse

freeCodeCamp.org

jQuery effects - Beau teaches JavaScript

jQuery effects - Beau teaches JavaScript

freeCodeCamp.org

More React Refactoring - Live Coding with Jesse

More React Refactoring - Live Coding with Jesse

freeCodeCamp.org

animate in jQuery - Beau teaches JavaScript

animate in jQuery - Beau teaches JavaScript

freeCodeCamp.org

"Finishing" My React Site - Live Coding with Jesse

"Finishing" My React Site - Live Coding with Jesse

freeCodeCamp.org

Starting a New React Project (P2D1) - Live Coding with Jesse

Starting a New React Project (P2D1) - Live Coding with Jesse

freeCodeCamp.org

React Project 2 Day 2: Learning Material UI - Live Coding with Jesse

React Project 2 Day 2: Learning Material UI - Live Coding with Jesse

freeCodeCamp.org

The Agile Manifesto - Agile Software Development

The Agile Manifesto - Agile Software Development

freeCodeCamp.org

jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript

jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript

freeCodeCamp.org

React Project 2 Day 3 - Live Coding with Jesse

React Project 2 Day 3 - Live Coding with Jesse

freeCodeCamp.org

The INVEST approach to product backlog items

The INVEST approach to product backlog items

freeCodeCamp.org

React Project 2 Day 4 - Live Coding with Jesse

React Project 2 Day 4 - Live Coding with Jesse

freeCodeCamp.org

Chickens and Pigs - Agile Software Development

Chickens and Pigs - Agile Software Development

freeCodeCamp.org

React Project 2 Day 5 - Live Coding with Jesse

React Project 2 Day 5 - Live Coding with Jesse

freeCodeCamp.org

jQuery: add and remove DOM elements - Beau teaches JavaScript

jQuery: add and remove DOM elements - Beau teaches JavaScript

freeCodeCamp.org

React Project 2 Day 6 - Live Coding with Jesse

React Project 2 Day 6 - Live Coding with Jesse

freeCodeCamp.org

This video tutorial teaches you how to build a large language model from scratch using Python, covering the basics of large language models, machine learning, and deep learning. You will learn how to design and implement a large language model, optimize it for performance, and apply deep learning concepts to natural language processing.

Key Takeaways

Create a project directory and initialize everything
Set up a data pipeline and use logic analogies and step-by-step examples
Install PyTorch and CUDA for GPU acceleration
Create a tensor to represent the text data of the Wizard of Oz
Encode text into a torch tensor for easier PyTorch processing
Split data into training and validation sets
Use batches to accelerate the training process
Apply linear transformation to input and align input and output shapes

💡 Building a large language model from scratch requires a deep understanding of machine learning, deep learning, and natural language processing concepts, as well as the ability to design and implement a model that can effectively process and generate human-like language.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Claude AI vs ChatGPT: Which One Is Actually Better in 2026?

Compare Claude AI and ChatGPT based on real-world usage and benchmarking to determine which one is better in 2026

Claude AI vs ChatGPT: Which One Is Actually Better in 2026?

Compare Claude AI and ChatGPT to determine which AI model is better for your needs in 2026

Medium · Programming

IntelliBooks: Classic RAG vs Graph RAG vs Agentic RAG – Choosing the Right AI Retrieval Architecture for Enterprise AI

Learn to choose the right AI retrieval architecture for enterprise AI between Classic RAG, Graph RAG, and Agentic RAG

Fluid, natural voice translation with Gemini 3.5 Live Translate

Learn about Gemini 3.5 Live Translate, a new voice translation technology that enables fluid and natural conversations across languages

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)