How Attention Got So Efficient [GQA/MLA/DSA]

Jia-Bin Huang · Beginner ·🧠 Large Language Models ·7mo ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%

Key Takeaways

The video discusses the evolution of attention mechanisms in large language models (LLMs), specifically focusing on efficient and scalable methods such as DeepSeek sparse attention (DSA), Group Query Attention (GQA), and Multi-head Latent Attention (MLA).

Full Transcript

In late September 2025, Deepseek released an experimental model. This latest model introduced a new attention barren known as Deepseek sparse attention or DSA. By leveraging DSA, the model matches the performance of earlier versions but drastically reduce compute costs cutting API pricing by 50%. But how can we make attention mechanism so efficient? As usual, let's build out the method from the first principle. But if you are already familiar with some of the basics, feel free to jump ahead to the relevant chapters. At a high level, when given the input prompt, the large language model generates an output response. But how does a model make sense of language? The very first step is to break up the text into pieces we call tokens. Each token is assigned to a unique ID in a dictionary. This is known as tokenization. But the token ID alone does not capture the semantic meanings of the words. Therefore, we map individual tokens into d-dimensional vectors called token embeddings. In this D- dimensional space, words with similar meanings have embeddings that are close to each other. However, these token embeddings on their own lock any information about the context. This is where attention comes in. Attention allows the model to determine how each token contributes to gather contextual information from the input sequence. We measure the relationship between tokens in a learn space specified by query and key matrices. The query matrix have shape D * DK where D is the embedding size about 7,000 in DC V3 or DC R1 and DK is the dimension of the query vector. We can first compute the query vectors by multiplying each token embedding with the query matrix WQ. Multiplying a 1x d embedding vector with a d by dk matrix result in a 1x dk query vector. Similarly, we compute a key vector by multiplying each token embedding with the key matrix wk. We quantify the relevance from one token to another using the vector dotproduct between the extracted query and key vectors. Here the dotproduct refracts how similar the query and key vectors are. As the model generates its response one token at a time. In other words, in an auto reggressive manner, we consider only the relevance between each token and all tokens that came before it. Since the dot product between the query and key vectors can yield both positive and negative values, we apply a softmax function to convert this row scores into normalized probability distributions. These resulting values are known as the attention scores. The attention scores indicate how much inference each token should have when updating the embedding of a particular token. For each token, we compute the corresponding value vector using a value matrix WV. This extracts the contextual information that could be transferred to other tokens. We then aggregate these value vectors into a single representation for each token by computing a weighted sum where the weights are given by the attention scores. But here the dimensionality of the value vectors are much smaller than the embedding dimension D. Therefore we use an output matrix to convert the output vectors back to the same dimensionality of token embeddings. We then add these residual embeddings to update the original token embeddings. This self attention mechanism enables the model to incorporate contextual information into each embedding ensuring that the refined embeddings capture relevant details from the entire sequence. Here the trainable parameters of a self attention layer include these four matrixes. Here we only illustrate the computation for the first attention layer but all subsequent attention layers in a deep model operate in the same way. However, explicitly writing out each vector can quickly go out of hand. Luckily, we can represent [music] the computation in a simple compact form. We denote the stack of query vectors as Q and a stack of key vectors as K. This allows us to concisely express all the dotproducts between pairs of vectors using the matrix multiplication Q transpose. We add a masking matrix M to ensure that the tokens only attend to tokens that come before it and itself. With the attention weights, we express the output matrix O in matrix form. More concisely, we denote the attention mechanism as a function of Q, K, and B. The residual embeddings are just the matrix multiplication of the matrix O and the output matrix W. Let's put it together visually. We extract the query key [music] and value vectors from the token embeddings using the corresponding matrices. We then compute the output O with attention mechanism and produce the residual embeddings delta X using the output matrix WO. But here is a problem. Using a single query key and value matrix is not enough to capture the complex relationships between tokens. [music] To address this, we introduce multiple sets of query key and value matrices, enabling the model to learn and represent diverse types of relationships between tokens. This is achieved by extending the size of query key and value matrixes. Here h is the number of heads. We show four hats in the illustration but in practice the number of hats can be much larger like 128 in DC V3 MDC R1. We split the query key and value matrices into edge group [music] and compute the attention for each group separately. This gives us a collection of attention outputs 01 O2 to O. We concatenate the outputs of multiple hands together and use our output matrix WO to produce the residual embedding delta X. This is known as multihat attention, a core building block of the transformer architecture that's behind the success of LLM today. Thanks for masking, the model can compute attention for all the tokens at the same time. This is called the preilling stage. Things become interesting when we start generating the output response. Let's say X7 is the first token in the output response. Our goal now is to compute the attention output 07 using the context provided by the preceding tokens. Similar to the prevailing stage, we extract the query key vectors using the corresponding matrixes. With these vectors, we compute the attention scores by taking the dot product and normalizing them with a soft max function. We can then produce the attention output 07 by computing the weighted sum of the value vectors. During this process, the key and value vectors from the previous tokens do not change at all. This is true across multiple attention hands in all attention layers. Recomputing these vectors every time for each new token is wasteful. A simple solution is to store the key and value vectors from previous tokens in memory and reuse them when needed. This is a technique known as key value caching or KV caching for short. The KB caching helps avoid unnecessary computation and therefore speed up the decoding process. How about query vectors? As we can see, all these entries are masked out and have no effect on the attention output. This is why we do not need to cache any of the previous query vectors. KB caching is very effective, but it's memory intensive. Let's take the model architecture of deepse V3 or R1 as reference. The total memory required for the KB caching is 2 * KB dimension time position time number of heads time number of layers times sequence length two because we need to cache both the key and value vectors. For each vector we need to store 128 numbers. For each number we need two bytes for float 16 precision and we have 128 attention hands in a layer and 61 layers in total. Suppose we have a sequence of 32k tokens. The total memory required for the KB cache amounts to 131 GB. This is a lot of memory. How can we reduce the memory requirements? Let's revisit the multi-head attention mechanism. We cannot adjust the KB dimension, the position or the number of layers. But maybe we can look at the number of hands. One simple idea is to reduce the number of hands for key and value matrices from H to one. This means that we produce only one copy of key and value vectors for each token in a layer. The key and value vectors are then shared across all attention hats. This is known as multi-query attention or MQA for short. Let's now compare the memory requirements of MQA with the original multi-headand attention. Multiand attention requires 4 mgabytes of memory per token. In contrast, MQA requires only 31 kilobytes per token. It's a 128fold reduction in memory usage. However, there's a catch. Multi-query attention sacrifice the ability to capture complex relationships between tokens. As a result, MQA's performance degrades considerably compared to the original multi-and attention. Perhaps the reason is that we go too far in reducing the number of hands all the way from 128 to one. Alternatively, we can reduce the number of heads just to a smaller value denoted as ng where G indicates groups. For example, here we set the number of groups as two. This means that we produce two distinct key and value vectors for each token K1 K2 and V1 B2. The key and value vectors are then shared across attention has within each group. This is known as group query attention or GQA for short. GQA strikes a balance between the memory efficiency of multi-query attention and the expressive power of multiand attention. It's a popular choice in modern ams including llama from Madam Quinn from Alibaba and Gamma from Google. For instance, if we divide 128 attention hands into 16 groups, we achieve an A-fold reduction in memory usage compared to standard multi-headand attention. Let's try understanding group query attention from another perspective. Here we use the key and value matrices to compute the two sets of key and value vectors. We can duplicate these vectors so we can use them for multi-headand attention. This process of duplication can be written as a simple matrix multiplication where I DK is an identity matrix of size DK by DK. Let's give this matrix a name as the up projection matrix W K up. Similarly, the process of duplicating the value vectors can be expressed as a matrix multiplication where IDB is identity matrix of size DV by DV. We call this matrix as the up projection matrix WV up. Now we can further simplify the key and value matrices by concatenating them along the column dimensions and give it a new name as W KB down. This expression reveals that in GQA the key vectors are obtained through a low rank factorization of the original key matrix. Same things apply to the value vectors. But the lowren factorization in GQA is somewhat restricted. For example, it groups the key and value vectors separately and use a fixed simple up projection matrix. We can extend this further by training both the down projection and up projection matrices. The down projection matrix WKB down compresses the token embeddings into a lowdimensional feature vector CB. The up projection matrix WK up and WV up then map this compressed vector back to the key and value vectors. The learn projection matrix allows the model to produce distinct key and value vectors for each attention head. This is the core idea behind multi-headand latent attention. The dimensionality of the compressed vector DC is much smaller compared to the full multi-headand attention. For instance, in deepc V3 R1, the compressed dimension DC is set to 576. In standard multi-head attention, we will need the number of heads times the dimension of the key and value vectors for each head times two for key and value vectors. As a result, this amounts to 57 times reduction in memory usage compared to multi-head attention. Not only does MLA reduce memory usage, it also slightly improves the performance compared to models using multihat attention. In addition to the key and value matrices, we can also apply low rank compression to the query matrix. Here the upper ejection matrix WQ takes the compressed vector of dimension DC and reconstruct the query vectors. This result in query vectors more multiple heads with size DK * H. While this does not decrease KB cache memory usage, it can still help reduce activation memory during training. Now at the first glance it may seem that we will need an extra matrix multiplication as the tradeoff for KV cache memory reduction but we can actually apply a neat trick to avoid this extra computation during inference. Here's how it works. Recall that the residual embedding delta X is the multiplication of the attention output O and the output matrix WO. The output O is the concatenation of the attention outputs from each individual attention hand. Let's write this out explicitly using matrix notation. Here the query vectors QI is the result of applying the down projection matrix WQ down and then up projection matrix WQ up. The key vectors KI is the results of applying the up projection matrix WK up to the compressed vector CV. This is a cache latent. But what does this subscript I mean in this up projection matrices the four up projection matrix WQ up and W K up maps a vector of dimension DC to a vector of dimension DK * H. The eyes up projection matrix refers to the specific sections of the up projection matrix that produce the output for the eyes attention head. Now we can see that the dot product between the query and key vectors involves the multiplications of three matrices. Due to the associated property, we can merge all these matrix multiplications into one. This means that we do not need to explicitly compute the key vectors during inference. This is great. But how about the operction matrix for the value vectors? Let's first simplify the equation as ai. This matrix AI stores the attention weights for the eyes attention head. Now recall the value vectors vi is the multiplication of the cache laten and the up projection matrix w up. This can be expressed as a block diagonal matrix where the up projection matrix for the value vectors lies along the diagonal. This suggests that we can absorb the upper junction matrix WV up into the output matrix WO. By incorporating the upperction matrix into the query and output matrices, we can compute attention efficiently without incurring extra computation. Excellent. But multi-head latent attention has a fundamental limitation. it is not compatible with rotary positional embedding or rope. But why is that the case? Let's examine how rotary position embedding enables the model to capture the relative positions of tokens. Imagine we have a token dog. Assuming that all the query and key vectors are two-dimensional vectors for now with robe the query vector corresponding the dog token is transformed by applying a rotation that encodes its positional information. For example, here we apply a rotation of one SA because the dot token is at the first position. For the sequence my dog we apply rotation of two SA. Similarly for the sequence I walk my dog, we apply a rotation of four sitta. Now suppose we want to compute the attention to the wall token. We compute its key vector. We rotate a vector by two sitta because walk is at position two. You might be wondering rotations like this. It turns out with the rotation the attention between the query and key vectors of the two tokens depend only on their relative position in their sequence. For example, we may change the sequence to every morning I walk my dog. Here the positions of the dog and walk token change resulting a different rotation transformation being applied to their corresponding vectors. But the angle between the two rotated vectors remain consistent. This holds true regardless how many preceding tokens there are. This suggests that the dot product between the query and the key vector now depends on the relative token positions. This is great. But in practice, the actual query and key vectors have more than two dimensions. To address this, we divide each vector into multiple two-dimensional components and apply rotation on each of them with different angles, enabling the model to encode various frequency of positional information throughout the dimensions. We then concatenate these rotated two-dimensional components to get the final query or key vectors. Let's represent this process more concisely using matrix vector multiplication. Now we are ready to understand why rope is not compatible with multi-head latent attention. Here is the dot product between a query vector and all the key vectors. CKV is the compressed cache latent. The upjection matrix for the key and query vectors can be absorbed into a single matrix. However, with rope as illustrated on the left, we encode the absolute positions of query and key vectors through rotations. This is problematic since the rotation matrix lie between the upperction matrix for the query and key. We can no longer absorb the upper ejection matrix WK up during inference. As a result, during inference, we must recomputee the keys for all preceding tokens, which decrease inference efficiency. To address this challenge, we need to introduce additional multi head queries and a share key in order to effectively encode position information when using rope. We concatenate the original query key vectors and the rotated ones as the input to the multihat attention computation. This approach is known as decoupled rope. These multih high latent attentions powers recent a such as deepc R1 and Kimmy K2. However, as we generate more tokens, we must calculate a tension between the current token and all preceding tokens. This results in a slower token per second throughput. To fix this, Deep Seek proposed a lightning indexer. The key concept is to quickly assess the relevance of each token and select only the most relevant ones for attention computation. This is the key idea behind DeepSync sparse attention DSA. But how do we implement this lightning indexer? The lightning indexer calculates an index score between the current query token and a previous token. These scores helps the model decide which previous tokens the query should attend to. We use a computation that is similar to the multi-head attention. Specifically, we compute individual query vectors for each head while using a single key vector that is shared among all heads. We then compute a dot product between a query at position t and the key vectors at position s. We discard the negative values with ru and multiply it with the weights derived from the query token at position t. Adding this up give us the index score. As with multih highlight attention, we apply rope to only a subset of dimensions and then concatenate the rotated and the original unrotated vector components. Same for all the key vectors from the previous tokens. This is called partial rope. Now you may wonder as this indexer also compute the attention between the query vector and all the key vectors. Why would this be efficient? The key step is to quantize the query and key vectors into 8 bits representations. Using a lower position is fine because our main goal is to identify the most relevant tokens rather than compute the exact attention scores. Quantization offers a coarse approximation greatly speeding up the calculation of index scores. However, naively quantize the query and key vectors may result in inaccuracy. To mitigate this issue, we need to apply a rotation. What does it mean? Let's take a look at the dot product between a queried vector QT [music] and a key vector KS. Sometimes some entries have large values while some have very small values. Floating point formats with higher positions such as floating 16 or above can accurately represent the full dynamic range of these values. But it becomes problematic when we quantize these vectors with lower precision. Let's work with a specific example of four-dimensional vector. Here each vector has both large and small values. One simple idea is to multiply the vectors with a random orthogonal matrix. The goal of this matrix [music] is to mix up the values of multiple entries in a vector. More specifically, the entry of the output vector is a weighted combination of all entries in the input vector. This makes the values in the output vectors well spread without extreme values. But this requires another matrix multiplication. A simpler and even more effective approach is to use transform. After applying the hard transform, the output vectors are deterministically computed and exhibit a uniform mixing across all coordinates. The hard transform effectively spreads out large spikes across all coordinates allowing the quantizer to retain more information. This transform can be implemented effectively with highly optimized GPU kernels. Furthermore, in practice, we don't need to compute the multiplication with this dense hot matrix. Instead, we use a fast wash transform which efficiently computes multiplication by a hot matrix without explicitly constructing the matrix relying solely on addition and subtraction operations. Let's get a sense of how this works. We randomly generate query and key vectors and compute the mean errors over 5,000 samples. This is the baseline results when we directly quantize the vectors into eight bits. When applying a random orthogonal matrix, we reduce the arrow slightly. Applying the hardon transform give us the most accurate approximation. An additional advantage is its improved stability as indicated by a lower standard deviation by using the lightning indexer to efficiently identify the most relevant tokens. Deep seek sparse attention enables two to three times faster processing of long sequences while reducing memory consumption by around 30 to 40%. The exciting part is that this efficiency boosts maintain the same level of performance as previous model. Now let's see how we can train this lightning indexer. In the [music] first stage, we freeze the main multiand latent attention layer and only update the parameters of the lightning indexer. To construct the target distribution, we begin by summing the main attention scores across all hands for each query token, followed by applying A1 normalization along the sequence dimension. This dense warm up stage aligns the indexer's output with the main attention distribution. After warming up the indexer, we add fine grain token selection and trend all the model parameters so that the model can learn the sparse attention patterns of DSA. During this stage, we still encourage the indexer to match the main attention distribution but only for the selected tokens. We also detach the indexer input from the computational graph allowing us to optimize it separately. The indexer is trained only on its own loss while the main model is optimized solely on the language modeling loss. I hope these videos provides the basic idea of the attention mechanism and various techniques to improve its efficiency. Thanks for watching and I'll see you next

Original Description

Attention mechanisms have been the key behind the recent AI boom. What happened after the multi-head attention in the seminal 2017 Transformer paper? In this video, we break down several core ideas that make attention efficient and scalable. 00:00 Introduction 00:35 Tokenization 01:21 Attention (vector form) 04:26 Attention (matrix form) 07:07 Key-Value caching 09:42 Multi-Query Attention (MQA) 11:03 Grouped Query Attention (GQA) 13:32 Multi-head Latent Attention (MLA) 15:37 MLA at inference time 18:15 Applying RoPE to MLA (decoupled RoPE) 22:18 DeepSeek Sparse Attention (DSA) 23:57 Quantization and rotation in DSA 27:44 DSA training References: - Multi-Head Attention (MHA): https://arxiv.org/abs/1706.03762 - Multi-Query Attention (MQA): https://arxiv.org/abs/1911.02150 - Grouped Query Attention (GQA): https://arxiv.org/abs/2305.13245 - Multi-head Latent Attention (MLA): https://arxiv.org/abs/2405.04434 - DeepSeek Sparse Attention (DSA): https://api-docs.deepseek.com/news/news250929 - Rotary Position Embedding (RoPE): https://arxiv.org/abs/2104.09864 Video made with Manim: https://www.manim.community/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

This video teaches the evolution of attention mechanisms in LLMs, focusing on efficient and scalable methods. It covers various techniques such as DSA, GQA, and MLA, and discusses their applications in modern models. By watching this video, viewers can gain a deeper understanding of attention mechanisms and learn how to apply efficient attention computation methods.

Key Takeaways

Break up text into tokens
Assign tokens to unique IDs
Map tokens to token embeddings
Measure relationships between tokens using query and key matrices
Apply low-rank compression to query matrices
Incorporate upper projection matrix into query and output matrices
Merge matrix multiplications using associative property
Quantize query and key vectors into 8-bit representations
Apply hard transform to spread out large spikes in vectors
Use fast Walsh transform to efficiently compute multiplication by a hot matrix

💡 The video highlights the importance of efficient attention computation methods in modern LLMs, and demonstrates how techniques such as DSA, GQA, and MLA can significantly improve performance and reduce computational costs.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

Open Assistant Live Coding (Open-Source ChatGPT Replication)

Open Assistant Live Coding (Open-Source ChatGPT Replication)

How To Create A Chatbot Using Python In 5 Minutes | Build Chatbot With Python | Simplilearn

How To Create A Chatbot Using Python In 5 Minutes | Build Chatbot With Python | Simplilearn

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

Related Reads

Building Production-Grade LLM Evaluation Pipelines: From Vibes to Metrics

Learn to build production-grade LLM evaluation pipelines to automate testing and catch hallucinations before deployment

I Replaced One $15-Per-Million-Token Frontier Model With Five Cheap Ones Arguing With Each Other.

Learn how using multiple small AI models can outperform a single large model, and why this approach is becoming a quiet trend in AI

10 ChatGPT Prompts Every Entrepreneur Should Use to Save Hours Every Week

Entrepreneurs can use ChatGPT to save hours every week by automating tasks and improving productivity

10 ChatGPT Prompts Every Entrepreneur Should Use to Save Hours Every Week

Boost productivity with ChatGPT prompts tailored for entrepreneurs, saving hours weekly

Medium · ChatGPT

Chapters (13)

Introduction

0:35 Tokenization

1:21 Attention (vector form)

4:26 Attention (matrix form)

7:07 Key-Value caching

9:42 Multi-Query Attention (MQA)

11:03 Grouped Query Attention (GQA)

13:32 Multi-head Latent Attention (MLA)

15:37 MLA at inference time

18:15 Applying RoPE to MLA (decoupled RoPE)

22:18 DeepSeek Sparse Attention (DSA)

23:57 Quantization and rotation in DSA

27:44 DSA training

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)