How does AI actually work? Transformers explained

AI Search · Beginner ·🧠 Large Language Models ·3mo ago

Skills: LLM Foundations90%LLM Engineering60%Fine-tuning LLMs50%

Key Takeaways

The video explains the transformer architecture and its role in modern AI models like GPT, covering topics such as tokenization, input embedding, positional encoding, and multi-head attention. It also discusses the attention mechanism, skip connections, and the use of decoder blocks to refine the meaning of input sentences.

Full Transcript

Today, AI can answer almost any question you throw at it. It can write poems, essays, or even generate a full research report for you in seconds. This seems way too good to be true. Have you ever paused and wondered how exactly does this work? What crazy stuff is going on under the hood to make this happen? Today, we're going to break this all down. We're going to go over how modern AI models like GPT, Gemini, Deepseek, and others actually work. We're going to go deep into the technical details, but don't worry, I'm going to explain this in simple terms so that anyone can understand. You see, the foundation behind all these models is the transformer architecture, which was first introduced in this legendary paper called attention is all you need by Google DeepMind. And the architecture used by most AI models today is nearly identical to this original paper. Yes, the models have gotten much larger. They've been trained on way more data. and researchers have made various improvements along the way to make them better, but the fundamental structure is still the same. Note that for the original transformer model, it had two halves, an encoder and a decoder. Think of it like a translator sitting between two people who speak different languages. Now, this setup was originally meant for translation, but it turns out that for chatbots like GPT and Gemini, we don't actually need this encoder component. So we can simplify it to look like this. And that's what we're going to go over in this video. We're going to go over how a decoderonly transformer works. First, let me give you a high level on how this works. This transformer is basically given a sentence fragment which is broken down into data that flows through the transformer. At the end, it spits out the next most probable word. This word gets added back to the original sentence and then that process repeats again and again. It spits out the next most probable word one at a time until it finishes its response. You might be thinking, "That's way too simple. How can this model that just predicts the next most likely word be able to write a full detailed essay or a medical research report for me?" But as crazy and bizarre as it sounds, that's exactly how it works. All right, let's get into the real stuff. I'm going to walk you through this entire architecture piece by piece. It sounds intimidating, but I promise each part is surprisingly intuitive once you understand it. Let's start from the beginning. What happens if you feed the sentence fragment I go to work by through this transformer model? First, the model can't actually read English or any language. It only understands numbers. So, we need a way to turn the text into numbers that the AI model actually understands. And this process is called tokenization. Now, if you needed to think of a way to do this, how would you do so? One way would be to just take each word separately in your vocabulary and give that a number. So for example, for the sentence, he is unhappy with the redesign, each unique word would have a labeled number. Plus, each space and each punctuation would also have a number. However, the problem with this is if we label each unique word in the English language, including conjugations, tenses, etc., it would be enormous. That's a ridiculous amount of labels needed, and it's very inefficient. Now, conversely, you can also take just each letter of the alphabet and assign a token. In that case, the data or vocabulary that we need to work with is super small. It's just 26 labels corresponding to the 26 letters plus spaces and other punctuation marks. However, the problem with doing this is you lose the semantic meaning of the words. Each letter by itself doesn't really have any meaning. It's like if you tried to read a book, but each letter at a time, it doesn't make much sense. It's the combination of letters which become words that give meaning to a sentence. So this method of tokenizing each letter also doesn't really work well. It turns out the ideal approach is actually somewhere in between where we don't label all the possible words in the English vocabulary nor do we separate each individual letter but instead we separate words into meaningful subparts. Going back to our sentence here, for example, unhappy might get broken down into un and happy and redesign would be separated into re and design. So, it kind of learns common prefixes, suffixes, and root parts of words. And this actually does a couple of amazing things. First, it keeps the vocabulary size manageable, much smaller than assigning a label to each unique word. Second, it handles unknown words really gracefully. If the AI gets a brand new word it has never seen before, say webinarification, instead of just saying I don't know, it would break this down into webinar and ification, and it can use its understanding of those parts to guess the meaning. So this way it can sort of handle novelty and words it has never seen before. This makes AI much more adaptable to new words, slang, and typos. And often these subwords capture meaning better than just the whole word or single letters. For example, after training, it would understand that un often means negation and re often means doing something again. All right, so after this tokenization step, the AI can identify each word or subword by a token, but it still doesn't know the meaning of each word. How do we give these numerical tokens a deeper meaning? We need a way to represent these words with numbers, but it needs to actually capture meaning, semantic relationships, and how words relate to each other. Well, this brings us to the input embedding step here. The model converts each token ID into a much bigger vector or basically a long list of numbers. In my demo here, for simplicity, note that I'm ignoring spaces just so it doesn't become too messy. But in reality, you also need to account for spaces and punctuation as well. Now, in my demo here, the vector length is 10. In other words, there are 10 numbers in each list. But in real AI models, these vectors are crazy long. For example, for GPT3, there are 12,288 numbers for each vector. I'm sure the later versions of GPT would have even more numbers. You can think of each vector as like giving each word coordinates in a highdimension space, like a map with hundreds of dimensions instead of just two or three. And the key here is that words with similar meanings end up being closer in this multi-dimensional space. For example, let's take the words man, woman, boy, and girl. Let's say they have vectors like this. You can think of each number as one dimension that represents a concept. For example, the first number in the vector could represent gender. So here you can see that man and boy are close together whereas woman and girl are at the opposite end. Or the second number might represent the age dimension in which case man and woman would be close together and boy and girl would be closer together at the other end because they are younger. At least conceptually that's how you can kind of think of this. Next, you might be wondering, how do we figure these numbers out? Well, actually, we don't. The model learns these values during training. I'll talk more about training later in the video. But now, let's assume that we already have a fully trained model where the numbers of these vectors are already optimized. All right. So, going back to our input sentence, I go to work by, it gets converted into these embedding vectors that capture the meaning of each word. That's this part here. But that's not enough. There's a subtle but crucial problem. Unlike your brain, which naturally reads words left to right, the transformer processes all words simultaneously. It sees the whole sentence at once, which is great for speed, but it means the model has no idea what order the words are in. For the sentences, the dog bit the cat and the cat bit the dog, to us, these are obviously different things. But to the transformer, they would look identical unless there's some way to give it the position of each word. And this brings us to the next step which is positional encoding. Going back to our sentence I go to work by we need a way to add the position of each word 0 1 2 3 4 to each vector. But here's an important thing you need to remember. We can't just add a single number 1 2 3 or 4 to each vector. We need a way to represent 0 1 2 3 4 as a list of numbers of the same size. In our case 10 numbers to add them to our existing embeddings. Now, the original transformer paper solved this using s and cosine waves, which was quite genius. If you're curious, here's the original formula. And I know it sounds random, but it turns out that if you use alternating s and cosine functions at different frequencies, you get a unique list of numbers at every position. No two positions ever produce the same combination. So, zero would look like this, one would look like this, two would look like this, and so on. Each number can be turned into a unique pattern of numbers just like a fingerprint. Now for our example, since our vector length is only 10, we only need to use the first 10 numbers here. So after applying this positional encoding formula, the positions of 0 1 2 3 4 would look like this. And we just need to simply add these to our original embeddings to get a new vector. Now these new numbers basically store information about not only the meaning of each word, but also its position in the sentence. And that sums up the positional encoding step here. Afterwards, we can finally plug the data into this transformer. All right. So the next component is called masked multi head attention. The job of this is to build a deeper understanding of the sentence. This part is like the most important idea in the entire transformer architecture. It figures out which words in the sentence are most important to each other. In other words, it tries to understand context. For example, if you take the sentence, "The cat didn't cross the street because it was too tired," it could refer to the street or the cat. But when we read it, we know that it's referring to the cat. Now, before transformers, older AI models had a really hard time understanding context like this. That's because they mostly processed text one word at a time. By the time it reaches the word it, the connection to earlier words could already be weak or blurry. The transformer approaches this differently. Instead of only looking at the previous word, the model can look at all words in the sentence at once. Think of this as like for each word, it would look at all the other words in the sentence before it and also itself and try to figure out which ones are the most important or relevant. So how does this work? Well, it does this by using three special vectors that are learned for each word. These are called the query, the key, and the value vectors or QV for short. Think of it like this. The query is like the current word asking the question, what information out there is relevant to me? And then the key K is like every word in the sentence raising its hand and saying, here's what I represent. Here's my label. See if I match your query. So K kind of offers itself up and then the value V is basically the value associated with each word. It says if you find my key relevant to your query, here's the actual substance. Here's the information or meaning that I can provide you. So Q asks, K provides a label to check against and V provides the actual content if there's a match. All right. Next, you might be wondering how are these QV vectors even created. Well, for the Q vector, you basically take your input embeddings for each word and then you basically multiply that by a matrix or a grid of numbers like this, which we can call WQ by convention. And then after multiplying it, it would spit out your Q vectors like this. You might be wondering how do we figure out the values of this matrix. Well, the nice thing is you don't. These values are learned during training which I'll talk about later in the video. So that's how we get the Q vectors for the sentence. Next, to get the K vectors, it's the same thing, but this time we multiply our embeddings by another matrix. This time it's called W K to get our K vectors. And finally, it's the same thing for the V vectors. This is also calculated by multiplying our embeddings by another matrix called WV. And this would give us all our V vectors. All right, so we have all these Q, K, and V vectors. What do we do next? Well, here's the formula from the original transformer paper. In order to calculate attention, we simply plug it through this formula. So over here, we start by comparing the query vector for each word against the key vector for each word, including itself. You can imagine a table where every Q vector is multiplied by every K vector of each word. The result is a table of scores often called dotproducts which is a mathematical way of measuring similarity. I'll talk about what this means conceptually in a second, but first let's go through the entire formula. Now there's one really important thing we need to do. When the model is generating text, it can't look at future words, right? So if it's currently processing the word to, it can only use information from I, go and to words like work or by are hidden. So that's why this component is called masked attention. The mask prevents the model from cheating and seeing the future. So going back to this grid, we actually need to apply a mask to blank out all the cells of future words. So the current word can only look at itself and past words. All right. Now going back to the formula here, it turns out that just multiplying Q by K isn't very stable for training neural networks. So the authors also added this square root plus this softmax function to basically make the data cleaner and fit better. So let's also apply the formula back to our table. And finally, let me explain what exactly is going on conceptually so you can understand this. You see here, each value is basically answering a simple question. How relevant is this word to the one I'm currently processing? Higher scores mean higher relevance. So just to make it more intuitive, we can simplify this formula down to just dots where bigger dots mean higher relevance. And if we look at the word go, it's probably important to figure out who is going. So we can see that the word I is quite relevant. Similarly, for the word work, it's probably important to know who is doing the work. in which case the word I would have quite a strong relevance to the word work at least conceptually that's how you can think of it. Now going back to our original formula finally we need to multiply everything by the v vector of each word. So let's apply this multiplication and then all these values are added together like this. The result is a new vector for each word. But this new vector isn't just the meaning of the word anymore. It's now enriched with context. It's blended with information from all the other words it decided to pay attention to in the sentence based on relevance. And that's the core idea behind attention. That's how large language models understand context in a sentence. If you're looking for an AI that can actually get work done for you, you need to check out GenSpark, the sponsor of this video. Think of it less like a basic chatbot and more like your dedicated AI employee. GenSpark is an all-in-one AI workspace from Silicon Valley that reached 200 million in annual revenue in just 11 months. It integrates top tier AI models and delivers finished readytouse results, presentations, websites, data analysis, automated outreach and more. Version 3 takes automation to a whole new level. My favorite new feature is GenSpark workflows. It can fully automate your repetitive daily tasks. You can set it to scan industry news every morning, then deliver a concise daily briefing straight to your inbox. Or you can have it monitor customer feedback around the clock and share the highlights with your team. It connects seamlessly to over 20 popular tools, including Google Workspace, Slack, Notion, Salesforce, and more. And then there's Genpark Claw, your personal AI agent that runs entirely on your own dedicated cloud computer. Each user gets their own preconfigured always on cloud server with claw pre-installed. Your data stays private in an isolated cloud instance with full access control. Simply deploy claw to your frequently used apps and afterwards you can talk directly to GenSpark claw within your messaging app. Plus, GenSpark includes a complete productivity suite. Use AI slides for stunning presentations, AI sheets for automated data analysis, and AI docs to write reports or scripts. Here's the best part. Paid users get unlimited use of top models like Nano Banana, GPT Image, Gemini 3.1 Pro, GPT 5.4, Opus 4.6, and more. If you want an entire AI team working for you, click the link in the description below and discover GenSpark today. Now, all the calculations we just went over from calculating the QKV vectors to calculating this dotproduct table and summing everything up, that's what happens in one attention head. Now going back to our diagram here, note that it says multi- head attention. So instead of just one attention head, it's actually better if we have multiple attention heads. Now why do we need multiple heads? You see, there could be many different ways to define context or relevance. For example, for the word work, one attention head might be focusing on what verbs are related, in which case the word go would have a higher relevance. For another attention head, maybe the context could be focusing on what nouns or subjects are most relevant to work, in which case the word I would be more relevant. So, one attention head is often not enough because there could be multiple ways to define context between different words. And that's why in the original Transformer paper, they used multiple attention heads as you can see from this diagram. Each focusing on defining a different context. Each head performs its entire attention calculation independently. It gets its own QKV vectors. It goes through this dotproduct table. Goes through this entire process and sums up the values to give us its own results. You can think of this multi- head attention as like having different analysts looking at the sentence at once. Each head or analyst learns to focus on different types of relationships or different aspects of the sentence. For example, one head could look at subjectverb agreement. Another one might track pronoun references and so on. At least conceptually, that's how you can think about it. But at the end of the day, this is just a ton of math under the hood. So, going back to their original diagram, notice that after these multiple heads, there's this concat step. So, here's how it works. To simplify things for my illustration, let's just go with two attention heads. They each calculate their own attention vectors, each representing different ways to look at context. And at the end we get this. Next, the resulting vectors from each head would basically be concatenated or glued together into a single vector like this. So that's the concat step in this diagram from the original transformer paper. Next, this also needs to go through a linear step. In other words, the vectors are then multiplied by another matrix which we can call WO by convention. So we take our concatenated vectors and we multiply them by this matrix in parallel. And finally we are done with this multi head attention component which is basically this component here. We've done so many calculations. We calculated the QKV vectors, computed this dotproduct table, concatenated the results, then ran it through another matrix. To simplify things, let's just wrap this entire process into a block like this. And after going through this, it spits out five vectors with the same dimensions as before. In our case, each vector should have 10 numbers. Now, if this is your first time learning about the transformer architecture, by now your head is probably overwhelmed with numbers and calculations. Just from this one component, we've gone through so many steps. It has almost completely transformed the original vectors into something else. And that's not good. You see, the original vectors contain information about the meaning and the position of each word. After going through so many steps in this masked multi-head attention component, we risk losing that information. So, how do we prevent this? Well, what the researchers designed next was also quite genius. Right after this multi head attention block, they added this add and norm step. So, let's go over each of these really quickly. Let's go over the ad step first. Like I said, after the original vectors go through this multi head attention block, the resulting vectors can be very different. Well, this add step basically takes the original vectors and adds them back in with these new vectors. This way the original information of meaning and position are still kind of retained. You can think of this as like a shortcut for information to flow past or skip this multi head attention step. In fact, this part is also called a skip connection or a residual connection. This is also one of the most important components of this transformer architecture. Without it, the model would keep transforming its data so much that it could forget what the sentence meant in the first place. But with these skip connections, we can stack many more layers, build much deeper models, and still keep the original signal intact. Now, that was just the ad step. Next, we also have this norm step, which stands for normalization. Remember that AI models prefer data to be in a standard range. So, this normalization step basically adjusts all the numbers from these vectors so that they have a mean of zero and a standard deviation of one. Normalization prevents the values from becoming too large or too small, which can mess up the training. So, it basically helps the transformer learn more smoothly. All right, that was a ton of math and a ton of steps, but that covers these parts of the decoder. We've gone through this multi head attention block, then this add and normalization step. We are almost done. Next, we plug the results into this feed forward neural network, which looks like this. If you're not familiar with a neural network, think of this as like a series of dials and knobs which determine how much data flow through to the next layer. And specifically in this feed forward network, it contains an input layer which has the same amount of dials and knobs as our vector size, which in our case is 10 numbers. And then it contains a middle layer which has a lot more dials and knobs, often two to four times the input layer. And then afterwards all of this is shrunk down to an output layer which is the same size as our input layer which in our case is 10 outputs. So this outputs some new vectors of the same size as our input vectors. Now you might be wondering why on earth would we do this? Why take our vectors and then expand them through this neural network and then contract them back down again to the same size? This seems like extra work, right? Well, conceptually you can think of this step as attempting to capture deeper features or meaning from the sentence. After all, a neural network is great for capturing underlying patterns. So, think of this as like giving each word some extra thinking time or processing capacity based on the context it just gathered via attention. Now, after plugging the resulting vectors for each word through this neural network in parallel, we get some new vectors which again might be a bit different from our inputs. So we also have this add a and normalization step. So like before we just add our inputs back into the outputs via a residual connection so that some information from the input is preserved. And then next we also normalize this to make the data neat and tidy. All right. Finally that sums up one decoder block of the transformer model. Let's quickly summarize what we've gone through so far. We first convert our words into embeddings which contain the meaning of each word. Then we apply positional encoding to also add in the order of the words. Then we plug them through this multi head attention block so that it can understand context of the entire sentence. Then we add the inputs back in with the results to preserve some of the original information. Plus we normalize it to make the data neat and tidy. Then this goes through a neural network which gives it more time to analyze the context of everything. And then again we add the input back in with the outputs and then we normalize it again. So that's just one decoder block. But as you can see from this diagram, it turns out that you can link multiple decoder blocks together for even better results. You see, the point of doing multiple decoder blocks is you can think of each block as refining the meaning or representation of your input sentence a little more. For example, for your first decoder block, it might detect basic meanings and relationships between words. But for the second block, it might be able to detect some deeper meaning within your sentence. And then for the third block, it might understand even deeper relationships and so on and so forth. So you kind of need more of these decoder blocks to build a full high-level understanding of your entire sentence. Now after going through all the decoder blocks, the resulting vectors are then passed onto this linear layer followed by this softmax function. What on earth are these? Think of this as like the final boss of the game. So this is essentially another neural network where the number of inputs is your vector size in our case 10 numbers and then the number of outputs corresponds to the total vocabulary size of the AI model. In other words, how many total words and subwords does it have in its vocabulary? The first node should be the first word or subword of its vocabulary. In our case, the letter A. And then the second word would be the next word in alphabetical order all the way down to the last word, which in our case would be zygote. The goal of plugging our vectors through this step is to produce a probability distribution over all the possible words in its vocabulary. So the output of each neuron in this neural network is basically how likely that word is to be the next word of your sentence. And this softmax function at the end here makes sure that all these probabilities add up to one. Next, the model just randomly samples from this distribution based on these probabilities to select the next word in the sequence. So going back to our example, I go to work by here's what happens during inference. In other words, when you're using the model to generate an answer, well, we only care about predicting the next word. So even though the transformer produces an output embedding for every word in the sequence, we only take the embedding of the last word, in our example, the word by, and pass that through the final linear layer and softmax to get the probability distribution for the next word. Let's say it randomly samples from this and it selects the word with the highest probability, bus. Well, that's the word that it outputs, but its response might not be complete yet. So, the entire process, which I've explained in this video, basically loops again and again, outputting word after word until the AI model completes its response. And that's essentially how large language models like chat GPT work. Now, so far, what I've gone over is what happens when you use the model, which is called inference. Next, let's also go over what happens when we train a model. After all, how does it know which values to use for all these calculations in the transformer? Well, let's go back to the very beginning. Let's assume we're training a model from scratch, and it has no prior knowledge about the English language. Well, to train it, we basically feed it a ton of different examples of the English language, like all the data from the internet. for one round of training just as a quick example we can feed it the sentence I go to work by and it needs to guess the most probable word which is bus now because it has no prior knowledge about the language it doesn't actually know the meaning of each word so actually the initial embedding vectors which are supposed to hold the meaning of each word are just random values it then gets plugged through this multi head attention like we've gone over before and remember the first step is we need to calculate the QV vectors by multiplying our input embeddings by these matrices called WQ, W K and WV. Well, at the start, the values of these matrices are also random. It doesn't know the best configuration of values to use yet. Now, after going through this multi head attention, we also need to multiply the results by this WO matrix. And at the start, this matrix is also just random values. Next, after add and normalization, it goes through this feed forward neural network. And the values of these dials and knobs are also random at the start. And then after another add and normalization, and then after going through multiple decoder blocks where the values of all of these are also random, we end up at the final linear layer. And the values of these dows and knobs are also random at the start. So before you train it, you can think of this entire transformer as just a giant pile of random numbers. Now at the end, here's what happens. If somehow it outputs the word bus, which is the correct answer, well, it might be that this configuration of values is actually pretty good. We don't actually need to change anything. But chances are at the start, it's probably going to output a random word like banana because that's not the correct answer. It incurs an error or a loss. But knowing that it was wrong isn't enough. We also need to figure out what parts of the model caused this error. And this is where something called back propagation comes in. Back propagation works by sending the error backward through the entire network. It looks at every random value that was used in the calculation and determines how much each one contributed to the mistake. And once we know that, we use a method called gradient descent to update these values. Gradient descent simply nudges each weight slightly in the direction that reduces this error. Not a huge change, just a tiny adjustment. Then we repeat the whole process again for the next round of training. If it gets the word right, we don't need to change the values of the model. But when it gets a word wrong, then it incurs a loss and the model's values are adjusted slightly to fix the mistake. And this loop basically happens again and again. We feed it sequence after sequence. We train it millions or even billions of times using different samples of natural language. And slowly but surely, the model's dials and knobs get better and better at correctly predicting the next word in the sentence. The model begins to learn that certain words tend to follow other words. It starts to learn grammar, then sentence structure and meaning, and then even more subtle relationships across long pieces of text. Eventually, after enough training, these random numbers turn into a system that can actually correctly generate natural language. And that's how a transformer learns. All right, we've gone through a ton of math, a ton of calculations, and a ton of steps, but in essence, that is how large language models that you and I use today work. It's all based on the transformer model from this legendary paper. Attention is all you need. Without this breakthrough, chat GPT or any other large language model would not exist today. And the reason that this paper changed everything comes down to one key idea. This attention mechanism which allows the model to look at every word in the sentence at the same time and figure out context and relationships between words. Previous language processing models cannot do this. So it turns out that indeed attention is all you need. I hope after watching this video, you have a deeper understanding of how large language models work. It's a ton of math, but hopefully I've made it easy enough for you to understand conceptually and visually. This video actually took way longer than I expected. So, if you enjoyed it, please do share with as many people as possible to boost the algo. Shout out to the Google team for making such an insane breakthrough that changed the world. As always, I will be on the lookout for the top AI news and tools to share with you. So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content. Also, there's just so much happening in the world of AI every week, I can't possibly cover everything on my YouTube channel. So, to really stay uptodate with all that's going on in AI, be sure to subscribe to my free weekly newsletter. The link to that will be in the description below. Thanks for watching, and I'll see you in the next one.

Original Description

How GPT and other large language models (LLMs) work. Transformers deep dive. #ai #llm #machinelearning #datascience #agi Thanks to our sponsor Genspark. Try it for free https://bit.ly/4uM3PLS Attention is all you need https://arxiv.org/html/1706.03762v7 0:00 Intro 0:33 The transformer model 1:30 Predicting the next word 2:30 Tokenization 5:06 Representing meaning 7:17 Positional encoding 9:17 Attention head 14:49 Genspark 16:35 Multiple heads 19:30 Add and norm 21:45 Feed forward neural net 24:08 Multiple decoder blocks 24:50 Final layer 27:03 Training the model Newsletter: https://aisearch.substack.com/ Find AI tools & jobs: https://ai-search.io/ Support: https://ko-fi.com/aisearch Here's my equipment, in case you're wondering: Lenovo Thinkbook: https://amzn.to/4jWeKwH Dell Precision 5690: https://www.dell.com/en-us/dt/ai-technologies/index.htm?utm_source=AISearchTools&utm_medium=youtube&utm_campaign=precisionai#tab0=0 GPU: Nvidia RTX 5000 Ada https://nvda.ws/3zfqGqS Mic: Shure SM7B https://amzn.to/3DErjt1 Audio interface: Scarlett Solo https://amzn.to/3qELMeu

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

This video provides a comprehensive introduction to the transformer architecture and its role in modern AI models like GPT. It covers the key components of the transformer model, including tokenization, input embedding, positional encoding, and multi-head attention. By the end of this video, viewers will understand how LLMs work and be able to explain the attention mechanism and its importance in LLMs.

Key Takeaways

Feed a sentence fragment into the transformer model
Break down the sentence fragment into data that flows through the transformer
Spit out the next most probable word
Repeat the process until the transformer finishes its response
Tokenize the text into numbers that the AI model understands
Apply positional encoding formula to add position of each word to each vector
Use masked multi-head attention to build a deeper understanding of the sentence

💡 The attention mechanism is a crucial component of the transformer architecture, allowing the model to focus on different parts of the input sentence and capture complex relationships between words.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Top AI Papers on Hugging Face - 2026-07-02

Explore the top AI papers on Hugging Face, featuring new trends from agent memory to 3D applications, and learn how to apply them

Dev.to · Y Hành Nhan

How I'm Building MCP Servers for a Language Claude Doesn't Know Well

Learn how to build MCP servers for low-resource languages like Swahili, and improve language model performance

Dev.to · Gabriel Mahia

Qwen 3.6 27B: How a 27B Dense Model Beats a 397B Giant — The Engineer's Complete Local AI Deployment Guide

Learn how to deploy Qwen 3.6 27B locally and achieve better performance than a 397B model, with a step-by-step engineer's guide

I Tested Four Chinese LLMs So You Don't Have To — Here's What I Found

Discover the capabilities of Chinese LLMs and how they compare to Western AI giants

Chapters (14)

Intro

0:33 The transformer model

1:30 Predicting the next word

2:30 Tokenization

5:06 Representing meaning

7:17 Positional encoding

9:17 Attention head

14:49 Genspark

16:35 Multiple heads

19:30 Add and norm

21:45 Feed forward neural net

24:08 Multiple decoder blocks

24:50 Final layer

27:03 Training the model

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)