Transformers explained | The architecture behind LLMs

AI Coffee Break with Letitia · Beginner ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations85%

Key Takeaways

The video explains the transformer architecture, which powers recent AI breakthroughs, including how to structure inputs, attention mechanisms, positional embeddings, and residual connections. It also compares transformers to Recurrent Neural Networks (RNNs) and discusses their differences.

Full Transcript

[Music] the Transformer architecture Powers most of the impressive recent breakthroughs in AI the Transformer is behind systems like chat GPT Vision Transformers image generators Alpha fold 2 for predicting protein folding and many others so if you're interested to know about the Transformer this is the right video for you we already made a video explaining the Transformer but it was one of our first videos and I can do it so much better better now also there we did not spend enough time explaining self attention which we will do better this time so here we go with the remastered explanation of the Transformer architecture Transformers can work with any kind of data and by that I mean text images speech and so on as long as we represent the data as a set of vectors however it is not always straightforward to do this as for example text does not naturally come as a sequence of vectors that means before we can look at the inner workings of the Transformers we need to understand how to represent inputs as vectors so let's look at two examples text and images for text we do the so-called tokenization where we take a sequence of words and decompose it with the tokenizer into subwords from a predefined vocabulary for example by following wide spaces and break down compound words into their components if you want to know more about oranization check out our previous video on this then the subwords all get assigned a unique Vector the vectors could be initialized randomly or even better with word embeddings word embeddings work after the idea that distances between embeddings represent word similarity a word is defined by the company it keeps and words that are semantically more similar are initialized with vectors closed in the high dimensional Vector space you can easily download such word embeddings as they are precomputed by counting how often words appear next to other words in text on the Internet or other large corpora and Ne networks learn to assign to words similar embeddings if they both have the same neighbors you can learn more about word embeddings in our previous video now that we know how to represent text let's think how to represent images images are more naturally represented as vectors or at least matrices which are high dimensional vectors an image is composed of three matrices where each Matrix tells us for the red green and blue channels what the light intensity of that color is in the corresponding pixel one could take the rows of each Matrix and write them one after the other to get vectors but this would result in a lot of vectors and Transformers are much much slower with many vectors as we become clearer later in this video so what people do instead is to divide images into patches and apply to each patch the same linear neural network layer that trains together with the Transformer to find the right weights that sensibly change the dimensionality of P by P patches to a d * 1 Matrix which is a d dimensional Vector to summarize the prerequisite of Transformers is that whatever the input we must first decide for a way to represent this input with vectors all your networks including the Transformer process these vectors representations into better and better representations with each layer until the solution for the task is obvious in this final layer or linearly separable if we want to use jargon but compared to the other neural networks the Transformer does this processing in a specific way as following let's suppose we have an input sequence here of text and the task is for example to predict which token comes next or whether the sentence expresses a positive or A negative sentiment or any kind of other classification task we can think of we take our input sequence represented as vectors with word embeddings and one Transformer takes in this sequence updates the vectors and outputs as many vectors as it had in the input and preserves the dimensionality of the vectors but to do something meaningful with these Transformers we need to add special tokens for example a classification token at the end of the sequence this special token goes through the transformer in the same way as the other tokens do but it is special because to its output representation we usually append a linear classification layer that classifies from a list of words called the vocabulary which tokens come next and if we are trying to classify it assigns probabilities to these classes from the classification task and note that this is a simple classification layer or mathematically it is just a matrix multiplication that happens here which geometrically corresponds to drawing a separation line in the high dimensional space the word vectors live in in other words the solution here should be already obvious is as prepared by the Transformer such that we can tell fitting classes from unfitting classes just by drawing a line during training the Transformer processes the input gives output vectors and we run the classification layer on the special tokens and get the assigned class we compare the assigned prediction to the expected one from the data set compute the loss value and back propagate the loss value and update the internal parameters of the classification layer and the Transformer layer to values that minimize the loss does give better classification results next time okay but what happens in this mysterious box we call Transformer well it is composed of multiple Transformer layers one Transformer layer contains two things one of them is not so much it is just the same feed forward Network also called MLP sub layer acting on every input token such an MLP sublayer takes the input representation applies a dense layer with Jello activation that doubles the dimension then another dense layer with jell activation scales down the dimension again and it is the same MLP layer with the exact same weights we apply to each input token embedding okay let's see what we have a bunch of MLP layers processing each token independently of the others this is suboptimal because see this word representation well it does not even know that there are other words next to it and it's even worse for the classification token that should Aggregate and summarize the sentence information if we are to use it for classification but it has no connection to the sentence tokens at all while the Transformer layer saves a lot of compute time because all of these MLP layers computer output in parallel we need a way to communicate information in the context of the sequence so that the word works is informed of the existence and semantics of its nebor attention for example luckily this is what the self attention layer is for to let information flow within the context of the sequence from one embedding to its neighbors in a nutshell the attention layer computes how much of the representation of each of all neighbors we need to add to compute a new token representation which is the outcome of the self attention layer by the way we will be using attention and self attention ion here synonymously but if you're wondering what the difference between them is self attention is when we compute importances of the elements of a sequence with respect to the elements in the same sequence attention is more General because we compute the importance of the elements of one sequence to the elements in another sequence for example you can see here the self attention of it on the left and the attention of in on the right in is an element from a sequence different to the one above it now how does the attention layer compute these importances exactly well it is a bit complicated in the sense that it is a pile of linear algebra that uses the loss function to adapt the entries of weight matrices during training to make them work well in inference but neuron networks are never anything else other than huge piles of linear algebra so strap your yourself onto your chair because we will try to explain the attention computation as clear as possible self attention does the following it takes the input vectors and applies three different linear transformation to produce the keys queries and value vectors this means that for the queries it multiplies the query Matrix to the input vector and this results in a query Vector this query Matrix is randomly initialized before training and gradient descent adapts its values during back propagation to make them the right ones that reduce the loss on the training data and the same query Matrix applies to all inputs to get query vectors for all of them as for the keys we simply have another Matrix called the key Matrix which is differently initialized from the query Matrix that also multiplies to the input Vector to produce a key vector and to produce the value vectors we multiply a value matx matx to the input so in summary we have three different matrices all initialized randomly that linearly transform the input in different ways now what is self attention further doing to these different vectors it has just produced let's suppose we are calculating the attention for the input token Works to all other tokens in the sequence including itself it works the same for the other tokens too but we will just show it for works first we compute the scalar product between the query Vector of the token of interest and the keys of every other Vector then we divide by the square root of the dimension of the key vectors so square root of three then we apply the soft Max over all these values we can interpret these soft Max scores to be measuring how important each token in the input is for the token works so the token attention is 133% important for works works is 78% important to itself and the CLS token is 7% important now it gets interesting to get the final representation of works we take the sum over all value vectors weighted or multiplied with the softmax result so this is what we meant before by saying the attention combines the representation of the input the value vectors weighted by the important score empirically it turns out that one set of attention values in each layer is not enough to capture the complexity of relationships in our data think of it this way the attention importance scores Define a graph where it tells us for each token of how important it is to all other tokens but one graph is not enough to model all existing relationships in the same way you can Define your social network craft based on how many friends you have but you can also think of other types of connections like with whom of these people you work work together or with whom you share the same city there are multiple relationships and importances to be modeled given a set of tokens therefore the idea of multi-head attention is to let the network learn three or eight or 12 attention patterns instead of just one so we do not use just one set of query key and value matrices but three of them and each set is called an attention head as we initialize the key query and value matricies all randomly they will start with different values in their training process will produce different query vectors and they will usually capture different patterns that they detect in your data one head might focus on one pattern such as cor reference resolution and another one on identifying the subject in sentences if you wonder how many attention heads you need the answer is that you are free to choose it is a hyper parameter the more the better but often you cannot use very many as you quickly run out of of GPU memory especially because attention scales quadratically in time and memory so if you process a sequence that doubles the size you will need four times as much time to run and four times as much memory it is an active area of research to approximate attention with other operations that scale linearly instead of quadratically or to replace it all together with other operations that do the job of mixing information between tokens if you're interested in this topic please watch our previous videos on this but but in a nutshell it's fake news that attention is all you need you can replace it with other token mixing procedures too now let's recap what we have so far and what we still need for a full Transformer we have our input embeddings they go through the self attention layer that gives us representations that are informed on the fellow embeddings in the sequence then they go through the MLP layer All In Parallel but so far this Transformer layer behaves like our input sequence ween a sequence but a set if we were to reorder the tokens the Transformer would not change its outputs the results of the attention would be still the same as all operations there are commutative please check to convince yourself and the fit forward Network acts independently of all the other tokens anyway this is not great at so far the Transformer gives us the same output independently of the order of the input because images text and sound are sequences where order matters we need a way to tell the Transformer layer that this is the first token in the sequence and this is the second and so on and this is what positional embeddings do they are vectors that uniquely identify each position and we add these vectors to the input embeddings they work like house numbers to identify the specific position of each house in a street address how do we come up with the values for position embeddings well with certain rules or we can simply learn these vectors as well during the the training process of the Transformer if you want more details about positional embeddings and the numerous ways to implement them you can watch one of our previous videos on this okay now that we got this figured out there is one more thing missing and the architecture is complete the missing ingredient are the residual connections which after the self attention layer add the input of the self attention layer to its output a normalization operation reduces the values back again to the 0 to one range after the sum because otherwise after each residual connection with each layer the values would get larger and larger and larger and the same thing of adding the input back to the output happens around the MLP layer here in green the intuition behind residual connections is to make the learning job easier for each layer to arrive at the solution the network needs to transform the inputs but since it is allowed to keep the input through the residual connections each layer is forced to learn not the whole transformation but just the difference it needs to add to arrive at the output it's kind of breaking down the problem a bit and residual connections become even more important as usually with deep neural networks we usually do not just use one Transformer layer but append another Transformer layer to the output of the previous one and another layer and so on how many it's a hyperparameter and of course we are limitting by the amount of memory our gpus have the more the better because the Transformer gets more attempts to break down the problem and arrive of the solution which is easier than getting the solution in one go with just one layer and residual connections also help when training such a long stack of layers because during back propagation gradient signals can get lost by propagating from the end to the beginning very much like a whisper in the telephone game now this is more most of what you need to know about Transformer Basics since you now know the principles after which they predict the next word like GPT or classify the whole sequence another training procedure will left for the end is the so-called masked language modeling procedure used for Transformers of the bird family there we have a classifier token that we use to classify whether two sentences belong together or not but there's more 15% of tokens in the sequence are chosen Rand ly and masked out and replaced with a special mask token the training objective of bird is then to adapt its weights such that a linear mask classification head can choose from the vocabulary the word that we masked out in the input this mask language modeling procedure is great to train classification Transformers or so-called Transformer encoders predicting the next word is something for GPT like models so for Transformer decoders and if you're wondering what the difference between Transformers and recurrent neuron networks is then let's look at this simplified view while in Transformers we use attention to communicate information in parallel from each input token to every other token RNN process the first token and use that output as input together with the second token to process the second token then the output of the second token goes into the processing of the third token and so on and you see the problem that we need to wait for the second token to finish processing so we can start Computing the third token this means that rnn's train slower than Transformers so when Transformers revolutionize NLP it's because their architecture allowed them to read the entire internet because they could process tokens in parallel while with RNN nobody got to train onto the whole internet because it took so much time we hope you like this little introduction to the Transformer architecture and that you can impress your friends and family that you now know how chpt Works internally and there are countless of other great resources on this topic such as the illustrat Transformer blog post by Jay Alamar and the Transformer series of lisis Sano also I hope that my Patron supporters that voted for the Transformer explained video as a topic for the next video will be happy as I finally managed to finish this video I really thank them for their patience if you like this video do not forget to like And subscribe and we hope to see you next time okay [Music] [Applause] bye

Original Description

All you need to know about the transformer architecture: How to structure the inputs, attention (Queries, Keys, Values), positional embeddings, residual connections. Bonus: an overview of the difference between Recurrent Neural Networks (RNNs) and transformers. 9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector). Otherwise we do not get the 1x3 dimensionality at the end. Sorry for messing up the animation! Check this out for a super cool transformer visualisation! 👏 https://poloclub.github.io/transformer-explainer/ ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ Outline: 00:00 Transformers explained 00:47 Text inputs 02:29 Image inputs 03:57 Next word prediction / Classification 06:08 The transformer layer: 1. MLP sublayer 06:47 2. Attention explained 07:57 Attention vs. self-attention 08:35 Queries, Keys, Values 09:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector). 11:26 Multi-head attention 13:04 Attention scales quadratically 13:53 Positional embeddings 15:11 Residual connections and Normalization Layers 17:09 Masked Language Modelling 17:59 Difference to RNNs Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏 Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij Our old Transformer explained 📺 video: https://youtu.be/FWFA4DGuzSc 📺 Tokenization explained: https://youtu.be/D8j1c4NJRfo 📺 Word embeddings: https://youtu.be/YkK5IKgxp-c 📽️ Replacing Self-Attention: https://www.youtube.com/playlist?list=PLpZBeKTZRGPM8PNRyv6fNMcAW3dMDq_A- 📽️ Position embeddings: https://www.youtube.com/playlist?list=PLpZBeKTZRGPOQtbCIES_0hAvwukcs-y-x @SerranoAcademy Transformer series: https://www.youtube.com/watch?v=OxCpWwDCDFQ&list=PLs8w1Cdi-zva4fwKkl9EK13siFvL9Wewf 📄 Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural inform

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 0 of 60

← Previous Next →

AI Coffee Break - Channel Trailer

AI Coffee Break - Channel Trailer

AI Coffee Break with Letitia

How to check if a neural network has learned a specific phenomenon?

How to check if a neural network has learned a specific phenomenon?

AI Coffee Break with Letitia

A brief history of the Transformer architecture in NLP

A brief history of the Transformer architecture in NLP

AI Coffee Break with Letitia

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop

AI Coffee Break with Letitia

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

AI Coffee Break with Letitia

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision

AI Coffee Break with Letitia

Pre-training of BERT-based Transformer architectures explained – language and vision!

Pre-training of BERT-based Transformer architectures explained – language and vision!

AI Coffee Break with Letitia

GPT-3 explained with examples. Possibilities, and implications.

GPT-3 explained with examples. Possibilities, and implications.

AI Coffee Break with Letitia

Adversarial Machine Learning explained! | With examples.

Adversarial Machine Learning explained! | With examples.

AI Coffee Break with Letitia

BERTology meets Biology | Solving biological problems with Transformers

BERTology meets Biology | Solving biological problems with Transformers

AI Coffee Break with Letitia

Can a neural network tell if an image is mirrored? – Visual Chirality

Can a neural network tell if an image is mirrored? – Visual Chirality

AI Coffee Break with Letitia

The ultimate intro to Graph Neural Networks. Maybe.

The ultimate intro to Graph Neural Networks. Maybe.

AI Coffee Break with Letitia

Can language models understand? Bender and Koller argument.

Can language models understand? Bender and Koller argument.

AI Coffee Break with Letitia

GANs explained | Generative Adversarial Networks video with showcase!

GANs explained | Generative Adversarial Networks video with showcase!

AI Coffee Break with Letitia

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.

AI Coffee Break with Letitia

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS

AI Coffee Break with Letitia

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES

AI Coffee Break with Letitia

An image is worth 16x16 words: ViT | Vision Transformer explained

An image is worth 16x16 words: ViT | Vision Transformer explained

AI Coffee Break with Letitia

AI understanding language!? A roadmap to natural language understanding.

AI understanding language!? A roadmap to natural language understanding.

AI Coffee Break with Letitia

"What Can We Do to Improve Peer Review in NLP?" 👀

"What Can We Do to Improve Peer Review in NLP?" 👀

AI Coffee Break with Letitia

The curse of dimensionality. Or is it a blessing?

The curse of dimensionality. Or is it a blessing?

AI Coffee Break with Letitia

PCA explained with intuition, a little math and code

PCA explained with intuition, a little math and code

AI Coffee Break with Letitia

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

AI Coffee Break with Letitia

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.

AI Coffee Break with Letitia

Leaking training data from GPT-2. How is this possible?

Leaking training data from GPT-2. How is this possible?

AI Coffee Break with Letitia

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

OpenAI’s CLIP explained! | Examples, links to code and pretrained model

AI Coffee Break with Letitia

Transformers can do both images and text. Here is why.

Transformers can do both images and text. Here is why.

AI Coffee Break with Letitia

UMAP explained | The best dimensionality reduction?

UMAP explained | The best dimensionality reduction?

AI Coffee Break with Letitia

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean

AI Coffee Break with Letitia

Transformer in Transformer: Paper explained and visualized | TNT

Transformer in Transformer: Paper explained and visualized | TNT

AI Coffee Break with Letitia

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?

AI Coffee Break with Letitia

Pattern Exploiting Training explained! | PET, iPET, ADAPET

Pattern Exploiting Training explained! | PET, iPET, ADAPET

AI Coffee Break with Letitia

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED

AI Coffee Break with Letitia

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

FNet: Mixing Tokens with Fourier Transforms – Paper Explained

AI Coffee Break with Letitia

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained

AI Coffee Break with Letitia

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.

AI Coffee Break with Letitia

Scaling Vision Transformers? How much data can a transformer get? #Shorts

Scaling Vision Transformers? How much data can a transformer get? #Shorts

AI Coffee Break with Letitia

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]

AI Coffee Break with Letitia

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained

AI Coffee Break with Letitia

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.

AI Coffee Break with Letitia

Adding vs. concatenating positional embeddings & Learned positional encodings

Adding vs. concatenating positional embeddings & Learned positional encodings

AI Coffee Break with Letitia

Self-Attention with Relative Position Representations – Paper explained

Self-Attention with Relative Position Representations – Paper explained

AI Coffee Break with Letitia

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes

AI Coffee Break with Letitia

Is today's AI smarter than YOU? #Shorts

Is today's AI smarter than YOU? #Shorts

AI Coffee Break with Letitia

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts

AI Coffee Break with Letitia

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts

AI Coffee Break with Letitia

What is tokenization and how does it work? Tokenizers explained.

What is tokenization and how does it work? Tokenizers explained.

AI Coffee Break with Letitia

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”

AI Coffee Break with Letitia

How modern search engines work – Vector databases explained! | Weaviate open-source

How modern search engines work – Vector databases explained! | Weaviate open-source

AI Coffee Break with Letitia

Eyes tell all: How to tell that an AI generated a face?

Eyes tell all: How to tell that an AI generated a face?

AI Coffee Break with Letitia

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

AI Coffee Break with Letitia

Data BAD | What Will it Take to Fix Benchmarking for NLU?

Data BAD | What Will it Take to Fix Benchmarking for NLU?

AI Coffee Break with Letitia

SimVLM explained | What the paper doesn’t tell you

SimVLM explained | What the paper doesn’t tell you

AI Coffee Break with Letitia

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?

AI Coffee Break with Letitia

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz

AI Coffee Break with Letitia

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?

AI Coffee Break with Letitia

This video teaches the basics of the transformer architecture, including self-attention mechanisms, positional embeddings, and residual connections. It also compares transformers to RNNs and discusses their differences. By watching this video, viewers can gain a deep understanding of the transformer architecture and its applications.

Key Takeaways

Initialize key, query, and value matrices randomly
Train the Transformer to produce different query vectors
Use positional embeddings to identify the position of each token in a sequence
Add residual connections after the self-attention layer and the MLP layer
Use masked language modeling as a training procedure

💡 The transformer architecture uses self-attention mechanisms to communicate information in parallel from each input token to every other token, allowing it to process tokens in parallel and train on large datasets.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Chinese AI Models: The OpenAI Alternatives Every Developer Should Know

Discover Chinese AI models as alternatives to OpenAI for developers, offering competitive performance and pricing

Benchmarking Chinese LLM APIs: DeepSeek V3 vs Qwen3 vs Kimi K2 — A Developer's Guide (2026)

Learn how to benchmark and choose the best Chinese LLM API for your application, saving costs without compromising performance

Sematic Coherance

Learn about semantic coherence as a structural condition for effective language models and its implications for AI development

Dev.to · Claire Goldbeg

What Is MCP (Model Context Protocol) and Why Everyone Is Talking About It

Learn about Model Context Protocol (MCP) and its significance in AI advancements

Dev.to · LePhuongTrung

Chapters (16)

9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vec

Transformers explained

0:47 Text inputs

2:29 Image inputs

3:57 Next word prediction / Classification

6:08 The transformer layer: 1. MLP sublayer

6:47 2. Attention explained

7:57 Attention vs. self-attention

8:35 Queries, Keys, Values

9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vec

11:26 Multi-head attention

13:04 Attention scales quadratically

13:53 Positional embeddings

15:11 Residual connections and Normalization Layers

17:09 Masked Language Modelling

17:59 Difference to RNNs

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)