LLM from Scratch Tutorial – Code & Train Qwen 3

freeCodeCamp.org · Beginner ·🧠 Large Language Models ·10mo ago

Skills: LLM Foundations90%LLM Engineering80%

Key Takeaways

This video tutorial demonstrates how to create a Large Language Model (LLM) from scratch, specifically building Qwen 3, using Google Colab and GitHub for code setup and training.

Full Transcript

Quinn 3 is the cutting edge series of large language models developed by Alibaba Cloud's Quinn team. The LLM is known for its advanced reasoning, multilingual support, and efficient hybrid thinking and non-thinking modes. In this course, you'll learn to build intelligence from the ground up, training Gwyn 3 from scratch, one line at a time. You'll see how gradients flow, models learn, and AI come alive in real time, gaining raw, unfiltered machine learning mastery. This conference, of course, will guide you through the details of Quinn 3's architecture and implementation. By the end, you'll have an understanding of how these advanced models function. Vuk Rosi developed this course. We will code and train Quen 3 large language model from scratch and you will understand all of it. the entire architecture and code and logic intuition behind it step by step everything is explained and there are some animations that I will show you that will make it easier for you to understand so we have this prompt interactive mode today I will have a walk and and let's see what it generated so there is a bit of coherence I just trained it for 15 minutes on Google Collab free GPU so that's not enough compute but if you train this for a few hours or if you use better GPUs uh then it will be a lot better. Today I will have a walk and discuss the following strict rules of the following command. Number one IK contra affair happen during the law rather than anything else. I think if you let this train for a few more hours it will actually make a lot more sense. This will just train quickly for a few minutes. So if you look at Quen technical report, you can see its architecture and you can also find architecture in the transformers library of hugging face. So this is going to be the most advanced transformer plus all of the things and specifics that Quen specifically has. For example, glue grouped query attention and swigglue activation for the feed forward layers. One thing I will also add is muon optimizer for 2D matrices. Uh I didn't see this in technical report but this is the new best optimizer. It trains better, it trains faster and I expect this to be in the next iterations of Quen because everybody will start using this. So uh let's scroll down. So I will be focusing on Quen specific things. If you are not 100% comfortable with basic attention mechanism or tokenizers then I expect I recommend you watch this course. If you go ahead and click here and click here then uh you can go here in the beginning and then I'm going to be explaining both tokenizers and attention mechanism 100% visually with these numbers with these vectors. So you will understand everything here. You see everything is 100% visual everything. I've been making this video for a long time to explain it well and the code as well. But you can even watch this entire course because it's very well explained. We will use the entire thing in our Quen. So by understanding this you will understand almost everything that we will do in Quen because language models are very similar but this course is definitely more beginner friendly. You can also go below and click this Andre Karpathy. So this is very famous course. You can watch this if you want. Uh some people find it a bit dense. I recommend you watch my course first because I think it's more beginner friendly and then you can watch this as well. It's very good but you will need some time to understand it. But you can also just watch uh this course after you watch my course. So you can watch all three. So we will use a GPU and we can start with the imports. So I'm going to use a GPU and I'm going to just run these imports. You can go here and change runtime type and select this GPU. Uh just one little warning I have. This notebook will ask you for hugging face token. Maybe you can just press cancel if it asks you because you don't need it. So uh I commented all of the imports. For example, what's important is this one is important. So this will automatically change uh precision of your variables of your data in the memory. Now this uh GPU is a bit older so it only supports float 16. Uh there is no BF16 but this will work well in this GPU or other GPUs even newer ones. So you can read this if you want but I will I will skip to the next one. So we will uh set random seats. So we want to control randomness of torch of cuda. So when we have different training runs, if we want to compare different training runs, different architectures, we don't want randomness to influence. Although there will be some randomness that comes from GPU hardware, which is impossible or I don't know very difficult or impossible to fix, but it's very minimal and it's okay. It doesn't matter. So below we will go on to model configuration. Uh here I would expect you to understand uh basic terms attention mechanism. So you can go and watch these other courses because I will try to explain this uh as well. Inner dimension of the model. This is the token embedding vector dimension. Uh number of heads in the self attention mechanism. number of decoding decoding layers I will be using these technical terms and if you don't understand them I recommend you watch those tutorials courses that I explained or you can just search on YouTube for self attention mechanism there is infinite courses number of decoder layers so one layer is attention and feed forward and then again second layer attention feed forward so they just stuck and this is the inner dimension of the feed forward network and it should be three times um this guy model embedding or no no four times usually it's four times but can can be different sizes not just four times I can show you here so uh decoder only transformer is only this one and you can disregard this middle multi head u cross attention so we just have this multi head attention and feed forward and that's one layer and then you stack them onto each other. In the end, we have after everything all of the layers we have linear and soft max. So this feed forward layer is actually uh this feed forward here. And so you got here you got expansion and then you got to this four times and then you got contraction back to exit the feed forward layer. Batch size is 24. This is just a bit lower for this uh Google Collab 3 GPU. And then maximum steps five. This is very low. I was just testing if my whole network, whole setup works. So you can just put this to 50. But if you want this to maybe learn a little bit, we can put to 2,000. Maybe this will take maybe 2 minutes or 1 minute on this uh GPU. Of course, the time it takes will depend on the model size, not just number of uh steps. So bigger model will uh take longer for every step. So once you want to seriously train it, you can add maybe two zeros here. Now this would train for too long on Google collab because this GPU is T4 very old. But if you have newer GPU and also I think uh 20,000 will also be good generation. But even 2,000 will be good. uh somewhat legible. It will make some sense. The text will make some sense. So, we will just keep it like this. Maybe for faster training and maybe to test it, maybe later we can increase, but for this tutorial just 2,000. It will take uh 3 4 minutes to train. Now, a bit of quen 3 specific parameters. So, there is eight heads, but there is just four key and value heads. So uh every so two query heads because this is just eight queries. So two query heads will share one key and value head. This is for faster computation. And they figured out a long time ago that we don't need uh eight key and value heads. We can share some. So during the computation these four uh key and value heads will be duplicated or repeated. So to match the number of query heads eight. So they will be like this uh key head A key head A repeated and then first two query heads will have its own key head and then keyhead B keyhead B etc. So it will just get repeated like this. So query head one and query head two they are different but they attend to same uh key head A and value head A both of them are just repeated. This is going to reduce KV cache memory massively by half. So when we have sequence of tokens we are calculating KV uh key and value heads for every token and we are saving it. But for the last token we are calculating query and we are not saving query for other tokens just for the last. So we use that query to check against all the other keys and then get the values. So we need to save values and keys of every other token in this cache. So we will reduce all of this memory by half here. So Quen can al also have this sliding window attention. This is for very long sequences. So you can have full attention for the entire sequence in one layer, one transformer layer and then in the next layer you just have last 4,000 tokens and then next layer you use full and then so that's one way to do it. But in this case we will just uh put larger value than our actual sequence length which will just disable it because uh we don't need this will require I think more compute and it's not necessary because we have very small sequence length but you can look into this if you want to implement this you can just use this command from hugging face transformers create sliding window causal mask. So you can just do it quickly and easily. And these functions are very well usually implemented. They have custom kernels. They're very fast. It's something that like best engineers implemented and they are very fast. So I recommend you using PyTorch functions or hugging face functions instead of coding your own unless you know like exactly how to make it fast and good or that or you know some trick to make it faster. Uh attention bias. So when you are initializing these uh query key value linear layers you can actually have bias there as well. Uh to the best of my understanding they are not used in most cases in original transformer but I've seen I've heard some people saying like some specific things maybe in pre-training it helps but then you disable it later but I'm not sure about this so I will keep it uh false. If anybody knows, you can comment below. Maybe the new OpenAI open source model GPT OSS used attention bias, but I'm not sure. And then this small RMS norm epsilon, it's a very small number just so we can later prevent division with zero. I think we can speed up now. So gradient accumulation steps. This is just instead of processing one batch and then updating gradients, updating weights I should say, you uh process this four times the batch, you accumulate all of the gradients, all of the weight changes. You add them and then you update the weights. This is going to simulate larger batch size. But in my experiments, it's better if you can crank up the batch size. It's faster. And uh but if not if your GPU is too small or not enough memory then you can just simulate larger batch size by just adding processing this batch four times and then adding all of the gradients. Muon optimizer hyper learning rate. So we need a bit higher learning rate. It's good uh this will actually make it learn faster because muon is good at fast learning. It already solves problems with some unstable big numbers and it orthog orthonormalizes update matrices. Um I'll talk about that maybe a bit later. Maximum number of tokens we have if you have better GPU you can increase this. So if you increase this that's maybe the best thing that will improve your training. You want to increase maximum sequence length but you also want to increase batch size. Okay. Okay, so you want to keep max sequence length uh probably at the powers of two and you want to keep all of these numbers at the powers of two if you can. These are not right now because the GPU is small but if you can you should and uh also know that as you make these powers of two as you increase your memory will increase exponentially. So even 2014 this would be like too much and then you can look at how many batches at the same time and was the sequence length. You can adjust reduce these two parameters if you don't have memory and if you cannot double this then you can increase batch size a little bit. I think usually when you decide on the architecture the best improvement to training you can do is by increasing maximum sequence length most of the time but it's more complex and nuance than that but anyways um we will return this to 512 and you want to get this up and get batch size up until you fill the memory so you don't want to just have GPU and wasted memory so it's difficult to calculate or to know how much memory this will taking in advance. So you just load this, you start the training, you check the memory and then you increase these parameters until you fill the memory. And then these are kind of not so important. Um starting from here, this is how many documents we download from data. Uh you can increase this but this is already a lot of documents downloading. So you don't need to worry. only if you are going to increase steps number of steps wherever that was by a lot then you maybe want to look into downloading more documents but you can also ask AI um I'm going to leave this whole code in the Python like you will have it in one Python file so you can copy all of the code you can just run that Python it will also do everything you don't need to change anything and you can use cursor or ask AI to help you with more data and explain is this is uh evaluation uh you can increase this if you don't want to see so many evaluations so often but I'm training for only a few steps so I want to have these evaluations just to check the training progress but you can you don't need to evaluate so often okay and then weight decay is just punishing large weights so we don't want weights to be large we want weights to Maybe between zero and one or up to 10 maybe but that's also a bit large because if we have large weight then small input change will have huge impact because uh you are multiplying weight by input. So if you have a huge number multiplying you don't want that. You don't want like single weight to carry so much importance. You want a bunch of different weights to carry equal amount of importance or less or more similar amount and then neural network can learn different patterns and stuff. Dropout and uh gradient clipping. This is just for regularization to make the training uh not over fit onto some data. And let's look at this. We just need to have model dimension be divisible by number of heads and number of heads divisible by number of KV heads and all this good stuff. Now this is the function that's going to repeat key and value heads. So as I said you have four and then the first key value head will get repeated two times because we have eight query heads. So first query and second query head will get the same key value head but repeated. So it can you can multiply. You have now eight of both. I added some comments here. So if we have these four dimensions, this is sequence length by the way. So we have batch number of key value heads, sequence left and head dimension. Okay. And then if we add this none here, this will inject one in dimensions. So this will create a dimension and then we can just use that to uh replace that number to the with the number of repeated repetitions. So in this case two because we have eight of these and four of these. So we need to repeat the four two times and then we can multiply here to repeat. Here I have some optional exercises for you to understand this. So this is not necessary for the language model. It's just some tasks that you if you want you can uh run this or you can copy these cells into CH GPT and ask it to help you to explain to you. You can use new study mode by CH GPT. It's very good. Okay. Now we go to the muon optimizer. Okay. So this is going to be tough to understand. This is pure mathematics. Um but I'll try my best to explain actually you can check my channel and you can search for muon optimizer there is a bunch of videos here so here muon and muon and there is like three four more videos so when you have weights these are neural network weights okay and then you do the back propagation and then you do and then you get what's called update matrix so update matrix just gets subtracted from weights So each number if this matrix gets subtracted from each number here element wise and the goal of this is to change weights so the loss is smaller loss goes down. So the error of the neural network goes down. So this is how neural network learns by updating weights making weights u better. But we have some issues with this raw update matrix that gets subtracted. So some of the numbers there could be arbitrarily high not because those high numbers would reduce loss better but simply because those numbers are getting multiplied by some arbitrarily randomly large input number because remember you are multiplying inputs times weights to get some output. Okay. So when you are calculating gradient, you are calculating how much this particular weight is changing the output. But this particular weight might be changing the output a lot just because some input number is arbitrarily large. For example, if you have cats, dogs, and fish. So you assign number one to cats, number two to dogs, and number 10,000 to fish. uh number 10,000 multiply by some weight will have huge influence. So this weight will have huge influence onto the output just because this randomly assigned 10,000 number 1 to 10,000. So when we go with the back propagation we see that this weight has a very a lot larger update and strength than the other weights. So we actually want to normalize that. We want to squish that. We want to prevent that. So this is what uh Newton Schulz will do, what neon optimizer will do. And I recommend you watch this back propagation from scratch. So it will explain this. It will help you a lot. So I put the link here above this cell. But I can show you this. So when you are multiplying some input vectors or numbers or vectors with weights, it does a linear transformation. You are multiplying a matrix with a matrix. So let's say these are input weights. These arrows you see how multiplying with a matrix it will stretch, rotate, reflect, it will do everything. So um what we want to do is we want to prevent this stretching because this stretching will happen when some weight is huge. So we just want to have rot rotating vectors not stretching because what happens often well it's getting complex this is from the tutorial back propagation from scratch but I will explain so sometimes stretching is good for reducing loss so sometimes stretching is good but uh the way I understand muon is they figured out that most of the time stretching is happening Not for good reasons, for loss reduction reasons, but just because these input numbers are arbitrarily high. That's my understanding. So it's better to re remove stretching all together even if stretching in some cases will improve loss. It does a lot more damage. So they are transforming matrices just to rotate and not to stretch. That's muon optimizer to the best of my understanding. It is a bit complex mathematical uh concept and in the future videos I will understand it even better and have better animations and ways of explaining it. You can also watch this video orthonormal matrix because orthonormal matrix is one that's going to orthogonal matrix is one that's going to rotate the vectors and normal matrix has this norm the length the magnitude when you square and sum and root all of the um elements close to one. So if you don't understand orthonormal matrix will just rotate vectors that's it and that's what muon does just rotating instead of stretching. So I'll leave a URL to this video as well. So this is what orthonormal matrix will do in this other tutorial. You see it just rotates. It doesn't squish or stretch. So how do we know matrix will just rotate? Um it's when we look at the rows of a matrix separately or columns but in this case rows and each row if we plot row as a vector then all of the rows all of the vectors will be normal to each other. So 90° 90° between every row. So that matrix will just rotate. So that's orthogonal matrix. I will be making more tutorials and you can check more tutorials on this and I will be explaining this even better in next uh courses when I get when I figure out like how to explain it better and how to have more animation tools and stuff. So so what is this uh Newton Schulz? So imagine you have any matrix and you have some function that going to transform this any matrix into orthogonal matrix. So it will just look at each row and make them normal to each other. So it's a function that transforms any matrix to orthogonal matrix. But this theoretical function there are a few different way to do it. But every each of the ways is too computationally complex. So you know this function as well as many other functions or maybe even every function I'm not sure can be approximated by a polomial. So polomial is just um you have x2 + x + 3 or xq + x^2. So it's just squaring that is multiplying and uh plus instead of having complex root inverse complex computationally complex functions. So whatever this theoretical function is we can approximate it with polomial. So this is the polomial that we are approximating this theoretical function that's going to take any matrix and orthogonalize it. So uh this is first we calculate this x * xrposed and then we have so we get this a and so we have a squared a and then without any a so this is the polomial if you look at this carefully I'm still trying to understand this part I'm still trying to like get intuition behind this and a lot of videos are about this on my channel. So, uh in the next courses I will understand this better. But uh this will not immediately transform this matrix into orthogonal. It will just move it a bit closer to orthogonal. So you need to put this matrix like five times. Apply this function five times. Every time it will be a bit more orthogonal or 10 times. So usually five to 10 times is orthogonal enough. A neural network doesn't need 100% perfect things usually like 100% perfect weights or initialization or matrices but uh close enough is good. So five times I think we do it here steps five and then that's going to transform our uh matrix into orthogonal and orthonormal because we have normalization later as well. Then we have this class defined muon optimizer. So we will just have some learning rate. This is high learning. This is very good because uh this can handle higher learning rate and it can converge faster. So uh what people found is that you can train the model uh with two times less data or almost two times less data. Even though you might see uh iterations per second maybe go a bit down but every iteration will teach the model a lot more. So I mean steps per second when you train the model. So don't think that just because muon optimizer has a bit slower steps per second that it's actually worse than Adam. Adam W for example it's actually learning more per every step than Adam W. So here it's going to use uh momentum. So it's going to use either standard momentum or this nester of momentum. It's going to calculate the update orthogonalized matrix here and then update the weights using the momentum as well. And this is just some transformations depending on the weights of the of the shape of the weights matrix. So it applies the update properly. You can copy this line of code and ask CHP learn to teach you step by step. uh we will not focus on uh this right now uh because it's simple for JAGP to teach you this and it's good for those viewers who want to understand this better so you can just see the shapes this is left to the viewer as exercise let's say then we have data loading it's very simple so we will use already done tokenizer uh tokenizers it's not so important I don't want to focus on that you can watch my course that I show to you it's simple. You will understand everything there and we will train and build tokenizer there as well. So we will use this tokenizer and this data. It's small LM corpus by hugging face. Uh this data set is very good for training small language models. It's very clean. It's simple and tokenizer is not too big. It's perfect for small language models. So we will just load this data set and this part of the data set and we will use this tokenizer and we don't want to load all at once because there is too much. So we will just load number of documents determined by our hyperparameters that I explained above. When it asks you for this uh hugging face token you can just press cancel. So you don't need to give that these data sets are uh public. So you don't need that. So this part will just process the data set the corpus. So data set is saved in this document. So document is just some JSON file. It has text, it has metadata, it has title, it has bunch of this stuff. So we just want to extract text from each document and then append all of the texts and so train on that big text our model. So our model will just see whole text combined of every document. And here I'm not appending more than 3,000 characters of every document maybe to keep the diversity up because we're just going to uh train this for a bit. And so we have some uh printing logging just showing how many tokens we have and stuff. So and that's it. That's prepared data set. Then we have this data set class that will manage this downloaded data downloaded text and prepare it for training. So we just create if we have all of the text we create some window sliding window. So we are putting this into the model and then this this it's like moving like this. So what we are putting into the model uh this is the code it's very uh small. I recommend you copy uh these things into CHGPT learn and just ask it for some examples and maybe ask it for some exercises because again coding is something you just need to get intuition for by uh trying playing with it a little bit. So I will leave this chat GPT conversation that will explain this to you. So it's this one. Uh we have let's say some tokens. These are like some words each of them. And then sequence length of four tokens. So length of data set it's going to be length of tokens uh minus sequence length which is three. So uh this is the input and then uh this is the output. You know for each token is predicting the next token. Okay. So you have these first two tokens, you predict this one. You have these tokens, you predict this one. You have these, you predict this one. So and then you can use the same thing uh just move by one. And you can use the same thing uh again. So that's how you get X and Y X sequence and then Y is just uh shifted by one every X. Then we go on to rope rotary positional embeddings. So when we have sequence of tokens, how does uh neural network know the LLM know which token is at which position? Well, we pass every key and every query of each token through this forward of rope. So again this is again rotating vector and rotating matrix that's going to rotate that vector that vector meaning like the key and query vectors and then neural network will learn based on the rotation that I showed you earlier the animation uh where this token is in the sequence and also it will learn for example that 100th token will be far away from the 10th token but close to 95th token. I will leave this link below. So this is what rope looks like. So uh we have position 0 to 512 for example in this case and uh you see how we use sign this is cosine actually and we have sign as well. So we use both of the components. Half of the rope vector is going to be cosine transformation cosine values and half will be s. So you see the difference every position has its own type of its own type of like values. So position zero has a lot of values at one. So these are by the way these are embedding dimensions going from 0 to 30 32 something maybe. So a lot of the positions are at have value one but as we go down you see that uh higher dimensions get lower values as we go like further in the sequence length. So what I'm trying to say is that every position has its own hard-coded vector. So this is hardcoded. This is not learned by network but network will see uh that this vector will rotate key and query vectors and it will know based on rotation uh which position rotated it. So which position that key and query belong to just you can just watch first part of this video that explains this. It's maybe four minutes or a bit more. I recommend you first watch uh this video. I actually have a good explanation here of how rope works. I drew everything. So I will leave the URLs here. So this one is the first video the theory that you should watch. You can watch beginning part of this video and this video as well. First it defines these frequencies. As I showed you the picture the sine and cosine wave frequencies uh become larger and larger. As you go to the right the sine waves become huge. And in the beginning they are very small and it will split your embedding beat query or key into two parts uh equal two equal parts and then apply cosine to the first part rotation and sign to the second part. So this is entire rotation matrix. So then it will again apply minus sign and cosine again to the same parts and then it will concatenate uh y1 and y2. So this is just rotation matrix that will rotate your vectors. This rotation is same as when I was showing to you uh in the muon optimizer works in the same way. It's just rotating all of the vectors. So depending on how much it rotates, the LLM will know the position this is in because the sign and cosine will be different as you go down the positions. They will have different values. Uh check this CH GPT URL for interactive one-on-one coaching by Chad GPT and you can turn on the learning mode or you can just copy this and tell it to teach you oneonone interactively. Then we go on to the self attention mechanism. the grouped query attention mechanism. So this is where the rope will be applied by the way here we are uh defining the rope here and we are uh applying or putting query and key through our rope to rotate them. But let's go back and look at this again. So so as I said I recommend you watch my course on llama 4. This has very good explanation of self attention mechanism. Uh and you will understand the basics and mathematics and visualization completely. So then after you understand that this will be quite simple I must say. Although if you see self attention for the first time it will take a few weeks. It took me some time to understand it. Just try to practicing it. Try drawing it. Try explaining it. Try understanding it. And you can try to explain the same way I explained in that course all of the numbers and all of the drawings. Very simple. So we just define these variables that I was explaining uh in the beginning of the video and then we create these projection matrices. So this will take our our token embedding and project multiply with some weights convert it to the key multiply with same embedding with different weights convert to query and value. So each of these will take the model dimension. So the vector the token embedding and project this into the key query value vector and this is interesting. So the dimension of these vectors is going to be number of heads. So number of query heads times dimension for the query head. So they are using key because I guess dimension is same for query key and value for each head dimension is same. So you can just use one parameter. So and that's it. So what this linear layer will produce is the whole long embedding that contains all of the heads and then in this self attention mechanism all of the all of the heads will be split and then all of the heads will interact with each other separately. So the first head from query with the first header key and value as I said this attention bias usually not used but maybe there is like some tricks to use it in some layers of training. I'm not sure exactly maybe in the future I will understand this better and tell you then we normalize all of the heads uh with RMS norm. RMS norm is very simple to understand. I recommend you ask just GPT learn mode to explain. It's very simple. So for every dimension in the vector, you square that dimension. Then you sum up all of the squares and then you divide uh the sum of the squares. So here you get uh you divide sum of the squares by the number of squares. Okay. Intuitively what you get here is average square value. So you get average square value of dimension of dimensions and then you root and now you get average dimension magnitude. So this is the RMS normalization. So that's the root mean square. But to normalize to apply the normalization you divide every dimension with this average magnitude of the dimension to normalize it. So you can continue the conversation with a GPT if you want and enable learn mode because I cannot share learn more mode. So you need to enable it manually. I'm going to give this URL here. Practice RMS norm oneonone with JGPT. uh here we will also define rotary and dropout. So the forward method of this self attention mechanism we get these query key value project matrices or yeah matrices. So there are some tricks you can combine all of this into a single computation. So instead of having three linear layers you just concat on it. You put one linear, second, third. You just put it into one big uh layer, big uh vector or big matrix I should say and then apply that to X and then split this part is for key this part for uh sorry this part for query key and value just split later but you can do it separately like this as well. Deep C combines it. A lot of people combine it but I feel like I think in quen it was separate. Then you separate uh every head and then you apply the normalization. This is our RMS norm and then we apply rotary embeddings. But see here we put head dimension first and then sequence length and u dimension of each head after because we just want to separate completely separate every head. So we don't want to have sequence length and then uh grouped like by tokens. We want to have grouped by different heads. So for each token we just have key query and value of this particular head. I recommend you copy this into check GPT and ask it to give you some examples or exercises if you don't understand this. I'm telling you this because there is so much details here and this course will be 10 hours and it wouldn't it would be confusing for people who already know this. So I'm trying to make this uh course fit for everybody and you can get one-on-one uh learning from Chad GPT. That's how I learned this. So that's the best way for these one-on-one learnings. Hello, it's me from the future. So I removed these transposes here, but we actually need to have them. Uh so that was a mistake. So just if you are wondering why this is going to look a bit different in the video for some time because I thought we are already transposing it here but we are not we are just per permuting these dimensions but we don't save this transpose into this uh queryant key. So we need to actually transpose them again. I thought I was transposing two times but then I wasn't actually. So I need to fix it. And then we repeat uh query key heads and then we just apply the attention and this is very good uh very fast attention by PyTorch. Usually you want to use these functions by PyTorch. They're very optimized very fast. It will be faster than if you code this yourself unless you are expert or you are coding something completely different. And in the end we just want to transform our uh whatever output is from attention into the vector embedding back. So this is how we finish the attention uh mechanism attention transformation. So then we have this feed forward network. Uh this is a classic feed forward. So we have app projection from model dimension from token dimension to the expanding four times bigger inner hidden uh dimension and then uh down projection from this four times bigger back to the model dimension. So this happens after attention layer output of attention goes into this. So we just project up. Uh this huge hidden layer has a lot of neurons that will uh hold some information and hold information on how to process output of attention, how to mix it up to uh get some useful information. But the thing with swigloo is we also have this gate. So it's same thing as up projection. I mean same transformation. So this uh gate will just activate and deactivate some of the neurons in this big hidden layer. So instead of so this gate um will be passed through silu and this is selu the pink one. So you see that if the value of input is larger than zero then the silo will return almost same value but if the value is smaller than zero then selu will return some value very close to zero. So we will basically have this as on and off switches. So some neurons will that are larger than one they will remain that that are larger than zero and lower neurons will become almost zero. Uh you just look at the pink ignore the blue one that's relu. So then we we multiply the output of the silu with the output of the up projection matrix. So with these neurons, the hidden layer neurons and for all of the neurons get that get multiplied with a values that are close to zero, they will get uh deactivated or they will get reduced or suppressed and other neurons uh will be just multiplied uh with these other gate uh numbers that they will not get suppressed or they might get amplified. So this is additional level of control. Uh this can learn to suppress and amplify uh these inner neurons based on the X input. So instead of always processing the neurons in the same way uh we have uh some amplification and reduction. So you can think of this gate time value as brightness control time light source. Light source being this act with this app projection which is source of information. And then brightness control is to amplify or to uh suppress it. And then transformer block. So we will just define all of our things. Uh I can show you the forward method defining attention. So we have input we normalize input we pass through attention and then we have some dropout I think we don't have it in this case we just set it at zero but we also have this residual connection so we are adding the input to the transformed attention so this will keep some initial information and then pass through norm and neural this feed forward and then residual connection and that's our transformer former block. So you see here we have multi head attention but norm is before and here feed forward but norm is before so it works better and then we have complete language model. So starting from defining the uh token embeddings which is vocab size. So for each token we have uh the token embedding. So it's dd model dimension position dropout I think we are not using any of we are just setting it to zero and then list of these uh transformer blocks which is equal to our number of layers which was six if I remember correctly normalization output dropout and then we tie input weights to the output uh weights so we have this output head that's going take token embedding and project it into the whole vocabulary size. So it will create probability for every token in the vocabulary to be the next token based on this whatever got processed here the output of the processing of the transformer. So it's this linear layer here. Now it will not give probabilities. It will just give logits numbers and then we will convert those logits into probabilities as I explained the softmax. But you can have same weights of this layer and the input embedding tokens layer. Uh it will work same and we can reduce memory and it will work well. Uh this example will illustrate better. So in the beginning we have these this lookup table. So first token has this vector embedding. Second token has this vector embedding. Actually you can take this entire matrix and use this as weights in the final linear layer. So you can use uh that to multiply the output of the transformer with the entire lookup table to give a vector that's vocab size length. So this example will illustrate. Let's say we have this lookup table. So this is in the beginning every token has its own vector embedding token one token two and then after the processing of the transformer in the end we get this vector. So this should also represent some kind of token but this may not be any specific token in the lookup table. It's just some input vector. So in the final output head we multiply this vector with this we get some number. So it's dot product we get some number and then this with the next one we get some number dot product. So if we have some token that's very close to this the dot product will get very very close to one very big. So then for every multiplication we will get uh some number and then we can apply softmax to find the probabilities. So the resulting resulting vector will be a length of uh vector so number of tokens and that's how we can use same weights uh to save on memory. I will leave this chat GPT chat here if you want to uh check it yourself. Below that we have the way of initializing weights. So there are some different better ways to initialize weights uh than just random initialization or you can use random or any of the ways and then forward for the transformer goes like this. You first convert input into these uh token embeddings and then you normalize it with square root of the model dimension. You pass that through the transformer blocks in the end normalize and get the logits. So that's the LLM. We can then apply softmax after. So during training we need to evaluate models performance. Uh we don't need to worry too much about this. This is just going to calculate average loss, accuracy and perplexity or confusion. So we have uh our logits and then we can calculate the difference. So we can use y to see which tokens are correct next tokens and then we can calculate the loss the difference between the correct next token and what our LLM predicted. I just added some comments to help you and you can copy this into AI and do some exercising on learning and we can go below and this is just going to set up muon and Adam W because mon is optimizer is just for 2D matrices. So number of dimensions if it's two and there is it's not uh token embeddings it's not norm then it's going to be handled by muon optimizer otherwise by Adam optimizer and you can see how many of the parameters are handled this will just log and then initialize both optimizers and uh Adam optimizer usually needs a lot lower learning rate than mu1. So 10 times lower in this case. And finally we have training loop setting seed initializing all of these good things. Optimizers uh learning rate schedulers. You want to you don't want to have constant learning rate. You want to change learning rate uh as the training goes. So in the beginning learning rate is low but it will uh quickly become big and then it will slowly drop off. Uh that's so as the model learns more we don't want to adjust weights too much because as the model as the training progresses model will know more and more and we want to make train uh learning slower and more stable. So this is the training loop. Uh this is just initializing and running everything. So uh passing uh input through the model getting loss and then calculating gradients and back propagation here backward with this loss by passing in the loss and then updating uh weights with optimizers. I will scroll through this and we also have this printing or logging and we have this evaluation that's going to happen every uh some amount of steps and this will show the progress bar. Then we have a bunch of printing and saving the model in the end. So that was defining the training loop and now we will run the training loop. So we will start with just some printing setting seed uh loading the config just printing everything and then just loading the data and tokenizer and then splitting into training and validation sets and then uh this is data training loader validation loader. So it will take this loaded text and feed it into the model and then the train model we just pass training loader into the model and validation loader. So model can use it and this is where the model downloads this stuff and starts the training. Wait, maybe this was not a mistake. Maybe this was not a mistake. I got an error. I'm not sure. Let's see again. It looks like it was not a mistake. So, it was a bit weird. So, our training will take 13 minutes. I'll just wait for it. Uh you look at the loss here. You see it should be going down. It will be going down. Especially as our learning rate is warming up, becoming bigger. It will start to rapidly go down soon. And then perplexity should go down. You see the learning rate is uh becoming bigger and bigger here very quickly and accuracy should go up and so now the learning rate is big now the loss should go rapidly down you see it now it's going very rapidly down loss is already 4.1 and guys if you know how I can fix this I don't want this to log my progress bar in every row but just keep it in one row you can tell And we need the seven more minutes to wait. And now learning rate is slowly going down. It first went up very quickly and then it's slowly going down as we train more. We don't want to update weights too much as we train as it learns more. Here we have validation and this is checking on different data than the testing data than the training data. But uh all of the data is always different because it's just going to see training data once. So anyways, this validation loss is very similar to the training loss. It's a bit lower even. Uh it's always confusing to me how validation loss can be lower, but maybe it's just randomly a bit lower. Uh and then validation accuracy, it's even bigger than this uh training accuracy for some reason. perplexity is better. It's lower. So I don't know how validation can be better than training. But now uh before this finishes training, I want to show you how we inference the model. So model will be saved and uh best model. PT for now, but in the end it's going to be named final model.pt. So we need to load the model here. It will load. We will initialize model everything and then we will use it to generate text. We have a few parameters. Temperature uh just means if we increase the temperature, the probabilities will all get uh more uniform. So if this was like high token with high probability, it will get squished down and low probabilities will get pushed up. So probabilities will be more uniform. So that generation will be more random. But if temperature is close to zero then it will get like very uh different probabilities. So uh you want to have random things. If you want to maybe generate poetry you have higher temperature but math and coding you have lower temperature. So there is no so much randomness. Top K, it just means we will uh sample from K most likely tokens. For example, 50 most likely tokens. And then top P sample from the smallest set of tokens whose cumulative probability exceeds P. For example, the fewest number of tokens whose combined probabilities add up to at least 90%. And that's how it samples. So the way it stops sampling is if it reaches max length or end of sequence token. So I think this inference code is not so important and it's not so uh difficult when you understand the training. We also have this uh interactive inference that you can just talk to it and prompt it and it will uh talk back to you. So then I have a few of these prompts to show uh what it's going to generate and this function will run everything and I'll show you. Looks like our training is complete and now we have final model. So I will load the model. I will uh define all of these inference functions that are top p top K and temperature and all of this cool stuff that uh calculates soft max of these next token logits. So gives you probabilities and then you do some sampling. Uh we want to just sample one token based on these uh here sampling methods and I want to define interactive chat and then I want to define this demo and then I want to run. So the future of artificial intelligence this is the beginning of the prompt and then it continues okay so it's a bit messy here engineering engineering and computer science and engineering the systems allow scientists to design which can help solve complex systems actually it does make some sense it does make some sense and then we also have some Python code that is funny but we just trained it for 15 minutes on like very small GPU and it already makes some sense but uh it might look a bit more impressive than it is because this is a whole token. So it can just put a bunch of words here and because whole word is a token it can seem a bit more like uh like it's organized even though like it's maybe more random but these whole tokens as words are making it seem like it's outputting like coherent words but I think it's still good for 15 minutes on this small GPU. So you can check the interactive mode. You can uh train this for a longer period of time. You can check other videos on my channel courses. I'm making all of the videos about AI research, AI science, AI engineering, LLMs, diffusion, everything. And uh see you next

Original Description

Lean how to create an LLM from scratch. In this tutorial you will build Qwen 3, one line at a time. Watch gradients flow, models learn, and AI come alive in real-time. Code on Google Colab - https://colab.research.google.com/drive/12ndGn_mI7R1GTbGS8I2EvajW50esJRRk?usp=sharing GitHub - https://gist.github.com/vukrosic/94dc965a22b0892042f44fed25918598 ⭐️ Contents ⭐️ ⌨ (0:00:00) Intro & Demo ⌨ (0:01:46) Qwen 3 Architecture ⌨ (0:02:36) Prerequisites ⌨ (0:04:01) Code Setup & Imports ⌨ (0:05:26) Model Configuration ⌨ (0:08:26) Qwen 3 Specifics ⌨ (0:12:24) Training Hyperparameters ⌨ (0:17:18) Grouped Query Attention Logic ⌨ (0:18:56) Muon Optimizer Explained ⌨ (0:29:02) Data Loading & Tokenization ⌨ (0:32:37) RoPE Positional Embeddings ⌨ (0:36:56) Self-Attention Code ⌨ (0:44:28) Feed-Forward & SwiGLU ⌨ (0:47:36) Building the Final Model ⌨ (0:52:34) Evaluation & Optimizer Setup ⌨ (0:54:08) The Training Loop ⌨ (0:55:43) Running the Training ⌨ (0:58:38) Inference & Text Generation ⌨ (1:00:51) Final Results ❤️ Support for this channel comes from our friends at Scrimba – the coding platform that's reinvented interactive learning: https://scrimba.com/freecodecamp 🎉 Thanks to our Champion and Sponsor supporters: 👾 Drake Milly 👾 Ulises Moralez 👾 Goddard Tan 👾 David MG 👾 Matthew Springman 👾 Claudio 👾 Oscar R. 👾 jedi-or-sith 👾 Nattira Maneerat 👾 Justin Hual -- Learn to code for free and get a developer job: https://www.freecodecamp.org Read hundreds of articles on programming: https://freecodecamp.org/news

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from freeCodeCamp.org · freeCodeCamp.org · 0 of 60

← Previous Next →

React: Production Server Setup Part 2 - Live Coding with Jesse

React: Production Server Setup Part 2 - Live Coding with Jesse

freeCodeCamp.org

cookies vs localStorage vs sessionStorage - Beau teaches JavaScript

cookies vs localStorage vs sessionStorage - Beau teaches JavaScript

freeCodeCamp.org

Browser history tutorial - Beau teaches JavaScript

Browser history tutorial - Beau teaches JavaScript

freeCodeCamp.org

Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)

Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)

freeCodeCamp.org

React: Parameterized Routing with Next.js - Live Coding with Jesse

React: Parameterized Routing with Next.js - Live Coding with Jesse

freeCodeCamp.org

React: Dealing with jQuery Issues - Live Coding with Jesse

React: Dealing with jQuery Issues - Live Coding with Jesse

freeCodeCamp.org

setInterval and setTimeout: timing events - Beau teaches JavaScript

setInterval and setTimeout: timing events - Beau teaches JavaScript

freeCodeCamp.org

Browser and Device Testing - Live Coding with Jesse

Browser and Device Testing - Live Coding with Jesse

freeCodeCamp.org

Last Minute Updates - Live Coding with Jesse

Last Minute Updates - Live Coding with Jesse

freeCodeCamp.org

Post Launch Updates - Live Coding with Jesse

Post Launch Updates - Live Coding with Jesse

freeCodeCamp.org

React: Setting Up Google Analytics - Live Coding with Jesse

React: Setting Up Google Analytics - Live Coding with Jesse

freeCodeCamp.org

React: Masonry Layout - Live Coding with Jesse

React: Masonry Layout - Live Coding with Jesse

freeCodeCamp.org

Load Balancing Digital Ocean Droplets - Live Coding with Jesse

Load Balancing Digital Ocean Droplets - Live Coding with Jesse

freeCodeCamp.org

try, catch, finally, throw - error handling in JavaScript

try, catch, finally, throw - error handling in JavaScript

freeCodeCamp.org

Load Balancing: SSL Passthrough Setup - Live Coding with Jesse

Load Balancing: SSL Passthrough Setup - Live Coding with Jesse

freeCodeCamp.org

Graphs: breadth-first search - Beau teaches JavaScript

Graphs: breadth-first search - Beau teaches JavaScript

freeCodeCamp.org

React: Masonry Layout Part 2 - Live Coding with Jesse

React: Masonry Layout Part 2 - Live Coding with Jesse

freeCodeCamp.org

React: WordPress API Live Search - Live Coding with Jesse

React: WordPress API Live Search - Live Coding with Jesse

freeCodeCamp.org

Creating WordPress Custom Post Types - Live Coding With Jesse

Creating WordPress Custom Post Types - Live Coding With Jesse

freeCodeCamp.org

Dates - Beau teaches JavaScript

Dates - Beau teaches JavaScript

freeCodeCamp.org

Miscellaneous Front End Updates - Live Coding with Jesse

Miscellaneous Front End Updates - Live Coding with Jesse

freeCodeCamp.org

Merging a Pull Request from GitHub - Live Coding with Jesse

Merging a Pull Request from GitHub - Live Coding with Jesse

freeCodeCamp.org

React + Prettier + Standard JS - Live Coding with Jesse

React + Prettier + Standard JS - Live Coding with Jesse

freeCodeCamp.org

React: Sortable Responsive Table - Live Coding with Jesse

React: Sortable Responsive Table - Live Coding with Jesse

freeCodeCamp.org

Geolocation Sorting by Distance - Live Coding with Jesse

Geolocation Sorting by Distance - Live Coding with Jesse

freeCodeCamp.org

Tradeoff Matrix - Agile Software Development

Tradeoff Matrix - Agile Software Development

freeCodeCamp.org

The Definition of Ready - Agile Software Development

The Definition of Ready - Agile Software Development

freeCodeCamp.org

Getting first React job without experience - Ask Preethi

Getting first React job without experience - Ask Preethi

freeCodeCamp.org

React: Google Analytics Click Tracking - Live Coding with Jesse

React: Google Analytics Click Tracking - Live Coding with Jesse

freeCodeCamp.org

Submitting a PR to an Open Source Project - Live Coding with Jesse

Submitting a PR to an Open Source Project - Live Coding with Jesse

freeCodeCamp.org

Should I go back to school to get CS degree? - Ask Preethi

Should I go back to school to get CS degree? - Ask Preethi

freeCodeCamp.org

Hero Section CSS Changes - Live Coding with Jesse

Hero Section CSS Changes - Live Coding with Jesse

freeCodeCamp.org

Working Agreement - Agile Software Development

Working Agreement - Agile Software Development

freeCodeCamp.org

A day at Pennybox with Co-Founder Reji Eapen

A day at Pennybox with Co-Founder Reji Eapen

freeCodeCamp.org

React: Sorting and Filtering Data - Live Coding with Jesse

React: Sorting and Filtering Data - Live Coding with Jesse

freeCodeCamp.org

React: Sorting and Filtering Data Part 2 - Live Coding with Jesse

React: Sorting and Filtering Data Part 2 - Live Coding with Jesse

freeCodeCamp.org

React: Building a New UI - Live Coding with Jesse

React: Building a New UI - Live Coding with Jesse

freeCodeCamp.org

Definition of Done - Agile Software Development

Definition of Done - Agile Software Development

freeCodeCamp.org

Getting started with jQuery (tutorial) - Beau teaches JavaScript

Getting started with jQuery (tutorial) - Beau teaches JavaScript

freeCodeCamp.org

Making a React Blog with WordPress Content - Live Coding with Jesse

Making a React Blog with WordPress Content - Live Coding with Jesse

freeCodeCamp.org

React, NextJS, CSS - Live Coding with Jesse

React, NextJS, CSS - Live Coding with Jesse

freeCodeCamp.org

jQuery events - Beau teaches JavaScript

jQuery events - Beau teaches JavaScript

freeCodeCamp.org

React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse

React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse

freeCodeCamp.org

React: Working with API Data - Live Coding with Jesse

React: Working with API Data - Live Coding with Jesse

freeCodeCamp.org

React: Refactoring Components - Live Streaming with Jesse

React: Refactoring Components - Live Streaming with Jesse

freeCodeCamp.org

jQuery effects - Beau teaches JavaScript

jQuery effects - Beau teaches JavaScript

freeCodeCamp.org

More React Refactoring - Live Coding with Jesse

More React Refactoring - Live Coding with Jesse

freeCodeCamp.org

animate in jQuery - Beau teaches JavaScript

animate in jQuery - Beau teaches JavaScript

freeCodeCamp.org

"Finishing" My React Site - Live Coding with Jesse

"Finishing" My React Site - Live Coding with Jesse

freeCodeCamp.org

Starting a New React Project (P2D1) - Live Coding with Jesse

Starting a New React Project (P2D1) - Live Coding with Jesse

freeCodeCamp.org

React Project 2 Day 2: Learning Material UI - Live Coding with Jesse

React Project 2 Day 2: Learning Material UI - Live Coding with Jesse

freeCodeCamp.org

The Agile Manifesto - Agile Software Development

The Agile Manifesto - Agile Software Development

freeCodeCamp.org

jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript

jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript

freeCodeCamp.org

React Project 2 Day 3 - Live Coding with Jesse

React Project 2 Day 3 - Live Coding with Jesse

freeCodeCamp.org

The INVEST approach to product backlog items

The INVEST approach to product backlog items

freeCodeCamp.org

React Project 2 Day 4 - Live Coding with Jesse

React Project 2 Day 4 - Live Coding with Jesse

freeCodeCamp.org

Chickens and Pigs - Agile Software Development

Chickens and Pigs - Agile Software Development

freeCodeCamp.org

React Project 2 Day 5 - Live Coding with Jesse

React Project 2 Day 5 - Live Coding with Jesse

freeCodeCamp.org

jQuery: add and remove DOM elements - Beau teaches JavaScript

jQuery: add and remove DOM elements - Beau teaches JavaScript

freeCodeCamp.org

React Project 2 Day 6 - Live Coding with Jesse

React Project 2 Day 6 - Live Coding with Jesse

freeCodeCamp.org

This tutorial teaches how to create a Large Language Model (LLM) from scratch, covering the architecture, code setup, training, and evaluation of the model. By following this tutorial, viewers can gain hands-on experience in building and training a language model.

Key Takeaways

Set up the code environment using Google Colab
Import necessary libraries and modules
Define the model architecture
Configure the model hyperparameters
Load and preprocess the data
Train the model
Evaluate the model performance

💡 Building a language model from scratch requires a deep understanding of the underlying architecture, hyperparameters, and training procedures.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Open WebUI: Installation, Features, Errors & Complete Beginner Guide (2026)

Learn to install and use Open WebUI with Docker for a seamless LLM experience

Pre-training vs Fine-Tuning: How AI Learns Before It Learns You — Part 25

Learn the difference between pre-training and fine-tuning in AI and how they enable models like ChatGPT to learn and answer questions effectively

Pre-training vs Fine-Tuning: How AI Learns Before It Learns You — Part 25

Learn how AI models like GPT and BERT learn through pre-training and fine-tuning, and why this matters for their ability to answer specific questions

Medium · Machine Learning

Washington Blinked: 18 Days That Bent The U.S.-China AI Race

The U.S.-China AI race is influenced by open source and closed models as competing strategic narratives, impacting global AI dominance

Forbes Innovation

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)