Transformers explained | The architecture behind LLMs

AI Coffee Break with Letitia · Beginner ·🧠 Large Language Models ·2y ago

Key Takeaways

The video explains the transformer architecture, which powers recent AI breakthroughs, including how to structure inputs, attention mechanisms, positional embeddings, and residual connections. It also compares transformers to Recurrent Neural Networks (RNNs) and discusses their differences.

Full Transcript

[Music] the Transformer architecture Powers most of the impressive recent breakthroughs in AI the Transformer is behind systems like chat GPT Vision Transformers image generators Alpha fold 2 for predicting protein folding and many others so if you're interested to know about the Transformer this is the right video for you we already made a video explaining the Transformer but it was one of our first videos and I can do it so much better better now also there we did not spend enough time explaining self attention which we will do better this time so here we go with the remastered explanation of the Transformer architecture Transformers can work with any kind of data and by that I mean text images speech and so on as long as we represent the data as a set of vectors however it is not always straightforward to do this as for example text does not naturally come as a sequence of vectors that means before we can look at the inner workings of the Transformers we need to understand how to represent inputs as vectors so let's look at two examples text and images for text we do the so-called tokenization where we take a sequence of words and decompose it with the tokenizer into subwords from a predefined vocabulary for example by following wide spaces and break down compound words into their components if you want to know more about oranization check out our previous video on this then the subwords all get assigned a unique Vector the vectors could be initialized randomly or even better with word embeddings word embeddings work after the idea that distances between embeddings represent word similarity a word is defined by the company it keeps and words that are semantically more similar are initialized with vectors closed in the high dimensional Vector space you can easily download such word embeddings as they are precomputed by counting how often words appear next to other words in text on the Internet or other large corpora and Ne networks learn to assign to words similar embeddings if they both have the same neighbors you can learn more about word embeddings in our previous video now that we know how to represent text let's think how to represent images images are more naturally represented as vectors or at least matrices which are high dimensional vectors an image is composed of three matrices where each Matrix tells us for the red green and blue channels what the light intensity of that color is in the corresponding pixel one could take the rows of each Matrix and write them one after the other to get vectors but this would result in a lot of vectors and Transformers are much much slower with many vectors as we become clearer later in this video so what people do instead is to divide images into patches and apply to each patch the same linear neural network layer that trains together with the Transformer to find the right weights that sensibly change the dimensionality of P by P patches to a d * 1 Matrix which is a d dimensional Vector to summarize the prerequisite of Transformers is that whatever the input we must first decide for a way to represent this input with vectors all your networks including the Transformer process these vectors representations into better and better representations with each layer until the solution for the task is obvious in this final layer or linearly separable if we want to use jargon but compared to the other neural networks the Transformer does this processing in a specific way as following let's suppose we have an input sequence here of text and the task is for example to predict which token comes next or whether the sentence expresses a positive or A negative sentiment or any kind of other classification task we can think of we take our input sequence represented as vectors with word embeddings and one Transformer takes in this sequence updates the vectors and outputs as many vectors as it had in the input and preserves the dimensionality of the vectors but to do something meaningful with these Transformers we need to add special tokens for example a classification token at the end of the sequence this special token goes through the transformer in the same way as the other tokens do but it is special because to its output representation we usually append a linear classification layer that classifies from a list of words called the vocabulary which tokens come next and if we are trying to classify it assigns probabilities to these classes from the classification task and note that this is a simple classification layer or mathematically it is just a matrix multiplication that happens here which geometrically corresponds to drawing a separation line in the high dimensional space the word vectors live in in other words the solution here should be already obvious is as prepared by the Transformer such that we can tell fitting classes from unfitting classes just by drawing a line during training the Transformer processes the input gives output vectors and we run the classification layer on the special tokens and get the assigned class we compare the assigned prediction to the expected one from the data set compute the loss value and back propagate the loss value and update the internal parameters of the classification layer and the Transformer layer to values that minimize the loss does give better classification results next time okay but what happens in this mysterious box we call Transformer well it is composed of multiple Transformer layers one Transformer layer contains two things one of them is not so much it is just the same feed forward Network also called MLP sub layer acting on every input token such an MLP sublayer takes the input representation applies a dense layer with Jello activation that doubles the dimension then another dense layer with jell activation scales down the dimension again and it is the same MLP layer with the exact same weights we apply to each input token embedding okay let's see what we have a bunch of MLP layers processing each token independently of the others this is suboptimal because see this word representation well it does not even know that there are other words next to it and it's even worse for the classification token that should Aggregate and summarize the sentence information if we are to use it for classification but it has no connection to the sentence tokens at all while the Transformer layer saves a lot of compute time because all of these MLP layers computer output in parallel we need a way to communicate information in the context of the sequence so that the word works is informed of the existence and semantics of its nebor attention for example luckily this is what the self attention layer is for to let information flow within the context of the sequence from one embedding to its neighbors in a nutshell the attention layer computes how much of the representation of each of all neighbors we need to add to compute a new token representation which is the outcome of the self attention layer by the way we will be using attention and self attention ion here synonymously but if you're wondering what the difference between them is self attention is when we compute importances of the elements of a sequence with respect to the elements in the same sequence attention is more General because we compute the importance of the elements of one sequence to the elements in another sequence for example you can see here the self attention of it on the left and the attention of in on the right in is an element from a sequence different to the one above it now how does the attention layer compute these importances exactly well it is a bit complicated in the sense that it is a pile of linear algebra that uses the loss function to adapt the entries of weight matrices during training to make them work well in inference but neuron networks are never anything else other than huge piles of linear algebra so strap your yourself onto your chair because we will try to explain the attention computation as clear as possible self attention does the following it takes the input vectors and applies three different linear transformation to produce the keys queries and value vectors this means that for the queries it multiplies the query Matrix to the input vector and this results in a query Vector this query Matrix is randomly initialized before training and gradient descent adapts its values during back propagation to make them the right ones that reduce the loss on the training data and the same query Matrix applies to all inputs to get query vectors for all of them as for the keys we simply have another Matrix called the key Matrix which is differently initialized from the query Matrix that also multiplies to the input Vector to produce a key vector and to produce the value vectors we multiply a value matx matx to the input so in summary we have three different matrices all initialized randomly that linearly transform the input in different ways now what is self attention further doing to these different vectors it has just produced let's suppose we are calculating the attention for the input token Works to all other tokens in the sequence including itself it works the same for the other tokens too but we will just show it for works first we compute the scalar product between the query Vector of the token of interest and the keys of every other Vector then we divide by the square root of the dimension of the key vectors so square root of three then we apply the soft Max over all these values we can interpret these soft Max scores to be measuring how important each token in the input is for the token works so the token attention is 133% important for works works is 78% important to itself and the CLS token is 7% important now it gets interesting to get the final representation of works we take the sum over all value vectors weighted or multiplied with the softmax result so this is what we meant before by saying the attention combines the representation of the input the value vectors weighted by the important score empirically it turns out that one set of attention values in each layer is not enough to capture the complexity of relationships in our data think of it this way the attention importance scores Define a graph where it tells us for each token of how important it is to all other tokens but one graph is not enough to model all existing relationships in the same way you can Define your social network craft based on how many friends you have but you can also think of other types of connections like with whom of these people you work work together or with whom you share the same city there are multiple relationships and importances to be modeled given a set of tokens therefore the idea of multi-head attention is to let the network learn three or eight or 12 attention patterns instead of just one so we do not use just one set of query key and value matrices but three of them and each set is called an attention head as we initialize the key query and value matricies all randomly they will start with different values in their training process will produce different query vectors and they will usually capture different patterns that they detect in your data one head might focus on one pattern such as cor reference resolution and another one on identifying the subject in sentences if you wonder how many attention heads you need the answer is that you are free to choose it is a hyper parameter the more the better but often you cannot use very many as you quickly run out of of GPU memory especially because attention scales quadratically in time and memory so if you process a sequence that doubles the size you will need four times as much time to run and four times as much memory it is an active area of research to approximate attention with other operations that scale linearly instead of quadratically or to replace it all together with other operations that do the job of mixing information between tokens if you're interested in this topic please watch our previous videos on this but but in a nutshell it's fake news that attention is all you need you can replace it with other token mixing procedures too now let's recap what we have so far and what we still need for a full Transformer we have our input embeddings they go through the self attention layer that gives us representations that are informed on the fellow embeddings in the sequence then they go through the MLP layer All In Parallel but so far this Transformer layer behaves like our input sequence ween a sequence but a set if we were to reorder the tokens the Transformer would not change its outputs the results of the attention would be still the same as all operations there are commutative please check to convince yourself and the fit forward Network acts independently of all the other tokens anyway this is not great at so far the Transformer gives us the same output independently of the order of the input because images text and sound are sequences where order matters we need a way to tell the Transformer layer that this is the first token in the sequence and this is the second and so on and this is what positional embeddings do they are vectors that uniquely identify each position and we add these vectors to the input embeddings they work like house numbers to identify the specific position of each house in a street address how do we come up with the values for position embeddings well with certain rules or we can simply learn these vectors as well during the the training process of the Transformer if you want more details about positional embeddings and the numerous ways to implement them you can watch one of our previous videos on this okay now that we got this figured out there is one more thing missing and the architecture is complete the missing ingredient are the residual connections which after the self attention layer add the input of the self attention layer to its output a normalization operation reduces the values back again to the 0 to one range after the sum because otherwise after each residual connection with each layer the values would get larger and larger and larger and the same thing of adding the input back to the output happens around the MLP layer here in green the intuition behind residual connections is to make the learning job easier for each layer to arrive at the solution the network needs to transform the inputs but since it is allowed to keep the input through the residual connections each layer is forced to learn not the whole transformation but just the difference it needs to add to arrive at the output it's kind of breaking down the problem a bit and residual connections become even more important as usually with deep neural networks we usually do not just use one Transformer layer but append another Transformer layer to the output of the previous one and another layer and so on how many it's a hyperparameter and of course we are limitting by the amount of memory our gpus have the more the better because the Transformer gets more attempts to break down the problem and arrive of the solution which is easier than getting the solution in one go with just one layer and residual connections also help when training such a long stack of layers because during back propagation gradient signals can get lost by propagating from the end to the beginning very much like a whisper in the telephone game now this is more most of what you need to know about Transformer Basics since you now know the principles after which they predict the next word like GPT or classify the whole sequence another training procedure will left for the end is the so-called masked language modeling procedure used for Transformers of the bird family there we have a classifier token that we use to classify whether two sentences belong together or not but there's more 15% of tokens in the sequence are chosen Rand ly and masked out and replaced with a special mask token the training objective of bird is then to adapt its weights such that a linear mask classification head can choose from the vocabulary the word that we masked out in the input this mask language modeling procedure is great to train classification Transformers or so-called Transformer encoders predicting the next word is something for GPT like models so for Transformer decoders and if you're wondering what the difference between Transformers and recurrent neuron networks is then let's look at this simplified view while in Transformers we use attention to communicate information in parallel from each input token to every other token RNN process the first token and use that output as input together with the second token to process the second token then the output of the second token goes into the processing of the third token and so on and you see the problem that we need to wait for the second token to finish processing so we can start Computing the third token this means that rnn's train slower than Transformers so when Transformers revolutionize NLP it's because their architecture allowed them to read the entire internet because they could process tokens in parallel while with RNN nobody got to train onto the whole internet because it took so much time we hope you like this little introduction to the Transformer architecture and that you can impress your friends and family that you now know how chpt Works internally and there are countless of other great resources on this topic such as the illustrat Transformer blog post by Jay Alamar and the Transformer series of lisis Sano also I hope that my Patron supporters that voted for the Transformer explained video as a topic for the next video will be happy as I finally managed to finish this video I really thank them for their patience if you like this video do not forget to like And subscribe and we hope to see you next time okay [Music] [Applause] bye

Original Description

All you need to know about the transformer architecture: How to structure the inputs, attention (Queries, Keys, Values), positional embeddings, residual connections. Bonus: an overview of the difference between Recurrent Neural Networks (RNNs) and transformers. 9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector). Otherwise we do not get the 1x3 dimensionality at the end. Sorry for messing up the animation! Check this out for a super cool transformer visualisation! 👏 https://poloclub.github.io/transformer-explainer/ ➡️ AI Coffee Break Merch! 🛍️ https://aicoffeebreak.creator-spring.com/ Outline: 00:00 Transformers explained 00:47 Text inputs 02:29 Image inputs 03:57 Next word prediction / Classification 06:08 The transformer layer: 1. MLP sublayer 06:47 2. Attention explained 07:57 Attention vs. self-attention 08:35 Queries, Keys, Values 09:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector). 11:26 Multi-head attention 13:04 Attention scales quadratically 13:53 Positional embeddings 15:11 Residual connections and Normalization Layers 17:09 Masked Language Modelling 17:59 Difference to RNNs Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏 Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij Our old Transformer explained 📺 video: https://youtu.be/FWFA4DGuzSc 📺 Tokenization explained: https://youtu.be/D8j1c4NJRfo 📺 Word embeddings: https://youtu.be/YkK5IKgxp-c 📽️ Replacing Self-Attention: https://www.youtube.com/playlist?list=PLpZBeKTZRGPM8PNRyv6fNMcAW3dMDq_A- 📽️ Position embeddings: https://www.youtube.com/playlist?list=PLpZBeKTZRGPOQtbCIES_0hAvwukcs-y-x @SerranoAcademy Transformer series: https://www.youtube.com/watch?v=OxCpWwDCDFQ&list=PLs8w1Cdi-zva4fwKkl9EK13siFvL9Wewf 📄 Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural inform
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Coffee Break with Letitia · AI Coffee Break with Letitia · 0 of 60

← Previous Next →
1 AI Coffee Break - Channel Trailer
AI Coffee Break - Channel Trailer
AI Coffee Break with Letitia
2 How to check if a neural network has learned a specific phenomenon?
How to check if a neural network has learned a specific phenomenon?
AI Coffee Break with Letitia
3 A brief history of the Transformer architecture in NLP
A brief history of the Transformer architecture in NLP
AI Coffee Break with Letitia
4 Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop
Our paper at CVPR 2020 - MUL Workshop and ACL 2020 - ALVR Workshop
AI Coffee Break with Letitia
5 The Transformer neural network architecture EXPLAINED. “Attention is all you need”
The Transformer neural network architecture EXPLAINED. “Attention is all you need”
AI Coffee Break with Letitia
6 Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
Transformer combining Vision and Language? ViLBERT - NLP meets Computer Vision
AI Coffee Break with Letitia
7 Pre-training of BERT-based Transformer architectures explained – language and vision!
Pre-training of BERT-based Transformer architectures explained – language and vision!
AI Coffee Break with Letitia
8 GPT-3 explained with examples. Possibilities, and implications.
GPT-3 explained with examples. Possibilities, and implications.
AI Coffee Break with Letitia
9 Adversarial Machine Learning explained! | With examples.
Adversarial Machine Learning explained! | With examples.
AI Coffee Break with Letitia
10 BERTology meets Biology | Solving biological problems with Transformers
BERTology meets Biology | Solving biological problems with Transformers
AI Coffee Break with Letitia
11 Can a neural network tell if an image is mirrored? – Visual Chirality
Can a neural network tell if an image is mirrored? – Visual Chirality
AI Coffee Break with Letitia
12 The ultimate intro to Graph Neural Networks. Maybe.
The ultimate intro to Graph Neural Networks. Maybe.
AI Coffee Break with Letitia
13 Can language models understand? Bender and Koller argument.
Can language models understand? Bender and Koller argument.
AI Coffee Break with Letitia
14 GANs explained | Generative Adversarial Networks video with showcase!
GANs explained | Generative Adversarial Networks video with showcase!
AI Coffee Break with Letitia
15 What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.
What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.
AI Coffee Break with Letitia
16 Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS
Multimodal Machine Learning models do not work. Here is why. Part 1/2 – The SYMPTOMS
AI Coffee Break with Letitia
17 Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
Why Multimodal Machine Learning models do not work. Part 2/2 – The CAUSES
AI Coffee Break with Letitia
18 An image is worth 16x16 words: ViT | Vision Transformer explained
An image is worth 16x16 words: ViT | Vision Transformer explained
AI Coffee Break with Letitia
19 AI understanding language!? A roadmap to natural language understanding.
AI understanding language!? A roadmap to natural language understanding.
AI Coffee Break with Letitia
20 "What Can We Do to Improve Peer Review in NLP?" 👀
"What Can We Do to Improve Peer Review in NLP?" 👀
AI Coffee Break with Letitia
21 The curse of dimensionality. Or is it a blessing?
The curse of dimensionality. Or is it a blessing?
AI Coffee Break with Letitia
22 PCA explained with intuition, a little math and code
PCA explained with intuition, a little math and code
AI Coffee Break with Letitia
23 Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
AI Coffee Break with Letitia
24 OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.
OpenAI's DALL-E explained. How GPT-3 creates images from descriptions.
AI Coffee Break with Letitia
25 Leaking training data from GPT-2. How is this possible?
Leaking training data from GPT-2. How is this possible?
AI Coffee Break with Letitia
26 OpenAI’s CLIP explained! | Examples, links to code and pretrained model
OpenAI’s CLIP explained! | Examples, links to code and pretrained model
AI Coffee Break with Letitia
27 Transformers can do both images and text. Here is why.
Transformers can do both images and text. Here is why.
AI Coffee Break with Letitia
28 UMAP explained | The best dimensionality reduction?
UMAP explained | The best dimensionality reduction?
AI Coffee Break with Letitia
29 NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean
NVIDIA Jarvis (now NVIDIA Riva) meets Ms. Coffee Bean
AI Coffee Break with Letitia
30 Transformer in Transformer: Paper explained and visualized | TNT
Transformer in Transformer: Paper explained and visualized | TNT
AI Coffee Break with Letitia
31 [RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?
[RANT] Adversarial attack on OpenAI’s CLIP? Are we the fools or the foolers?
AI Coffee Break with Letitia
32 Pattern Exploiting Training explained! | PET, iPET, ADAPET
Pattern Exploiting Training explained! | PET, iPET, ADAPET
AI Coffee Break with Letitia
33 Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
Deep Learning for Symbolic Mathematics!? | Paper EXPLAINED
AI Coffee Break with Letitia
34 FNet: Mixing Tokens with Fourier Transforms – Paper Explained
FNet: Mixing Tokens with Fourier Transforms – Paper Explained
AI Coffee Break with Letitia
35 Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained
Are Pre-trained Convolutions Better than Pre-trained Transformers? – Paper Explained
AI Coffee Break with Letitia
36 "Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.
"Please Commit More Blatant Academic Fraud" – A fellow PhD student's response.
AI Coffee Break with Letitia
37 Scaling Vision Transformers? How much data can a transformer get? #Shorts
Scaling Vision Transformers? How much data can a transformer get? #Shorts
AI Coffee Break with Letitia
38 How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]
How cross-modal are vision and language models really? 👀 Seeing past words. [Own work]
AI Coffee Break with Letitia
39 Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization +Tokenizer explained
AI Coffee Break with Letitia
40 Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
Positional embeddings in transformers EXPLAINED | Demystifying positional encodings.
AI Coffee Break with Letitia
41 Adding vs. concatenating positional embeddings & Learned positional encodings
Adding vs. concatenating positional embeddings & Learned positional encodings
AI Coffee Break with Letitia
42 Self-Attention with Relative Position Representations – Paper explained
Self-Attention with Relative Position Representations – Paper explained
AI Coffee Break with Letitia
43 Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts
Saddle points vs. local minima in high dimensional spaces | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
44 What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts
What is the model identifiability problem? | Explained in 60 seconds! | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
45 Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes
Data leakage during data preparation? | Using AntiPatterns to avoid MLOps Mistakes
AI Coffee Break with Letitia
46 Is today's AI smarter than YOU? #Shorts
Is today's AI smarter than YOU? #Shorts
AI Coffee Break with Letitia
47 Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts
Convolution vs Cross-Correlation. How most CNNs do not compute convolutions. | ❓ #Shorts
AI Coffee Break with Letitia
48 Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts
Why do we care about cross-correlations vs convolutions | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
49 The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts
The convolution is not shift invariant. | Invariance vs Equivariance | ❓ #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
50 How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts
How to increase the receptive field in CNNs? | #AICoffeeBreakQuiz #Shorts
AI Coffee Break with Letitia
51 What is tokenization and how does it work? Tokenizers explained.
What is tokenization and how does it work? Tokenizers explained.
AI Coffee Break with Letitia
52 Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”
Foundation Models | On the opportunities and risks of calling pre-trained models “Foundation Models”
AI Coffee Break with Letitia
53 How modern search engines work – Vector databases explained! | Weaviate open-source
How modern search engines work – Vector databases explained! | Weaviate open-source
AI Coffee Break with Letitia
54 Eyes tell all: How to tell that an AI generated a face?
Eyes tell all: How to tell that an AI generated a face?
AI Coffee Break with Letitia
55 Swin Transformer paper animated and explained
Swin Transformer paper animated and explained
AI Coffee Break with Letitia
56 Data BAD | What Will it Take to Fix Benchmarking for NLU?
Data BAD | What Will it Take to Fix Benchmarking for NLU?
AI Coffee Break with Letitia
57 SimVLM explained | What the paper doesn’t tell you
SimVLM explained | What the paper doesn’t tell you
AI Coffee Break with Letitia
58 Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?
Generalization – Interpolation – Extrapolation in Machine Learning: Which is it now!?
AI Coffee Break with Letitia
59 Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz
Do Transformers process sequences of FIXED or of VARIABLE length? | #AICoffeeBreakQuiz
AI Coffee Break with Letitia
60 The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
The efficiency misnomer | Size does not matter | What does the number of parameters mean in a model?
AI Coffee Break with Letitia

This video teaches the basics of the transformer architecture, including self-attention mechanisms, positional embeddings, and residual connections. It also compares transformers to RNNs and discusses their differences. By watching this video, viewers can gain a deep understanding of the transformer architecture and its applications.

Key Takeaways
  1. Initialize key, query, and value matrices randomly
  2. Train the Transformer to produce different query vectors
  3. Use positional embeddings to identify the position of each token in a sequence
  4. Add residual connections after the self-attention layer and the MLP layer
  5. Use masked language modeling as a training procedure
💡 The transformer architecture uses self-attention mechanisms to communicate information in parallel from each input token to every other token, allowing it to process tokens in parallel and train on large datasets.

Related Reads

Chapters (16)

9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vec
Transformers explained
0:47 Text inputs
2:29 Image inputs
3:57 Next word prediction / Classification
6:08 The transformer layer: 1. MLP sublayer
6:47 2. Attention explained
7:57 Attention vs. self-attention
8:35 Queries, Keys, Values
9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vec
11:26 Multi-head attention
13:04 Attention scales quadratically
13:53 Positional embeddings
15:11 Residual connections and Normalization Layers
17:09 Masked Language Modelling
17:59 Difference to RNNs
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →