Transformer Neural Networks - EXPLAINED! (Attention is all you need)

CodeEmporium · Advanced ·📐 ML Fundamentals ·6y ago

Skills: ML Maths Basics80%Supervised Learning60%

Key Takeaways

This video explains the Transformer neural network architecture, its components, and how it works, including the use of positional encoders, multi-headed attention layers, and self-attention, with tools like transformer, bert, and tensorflow.

Full Transcript

recurrent neural Nets they are feed-forward neural networks rolled out over time as such they deal with sequence data where the input has some defined ordering this gives rise to several types of architectures the first is vector to sequence models these neural nets take in a fixed size vector as input and it outputs a sequence of any length in image captioning for example the input can be a vector representation of an image and the output sequence is a sentence that describes the image the second type is a sequence to vector model these neural networks taken a sequences input and spits out a fixed length vector in sentiment analysis the movie review is an input and a fixed size vector is the output indicating how good or bad this person thought the movie was sequence to sequence models is the more popular variant and these neural networks taken a sequences input and outputs another sequence so for example language translation the input could be a sentence in Spanish and the output is the translation in English do you have some time series data to model well our nen's would be the go-to however rnns have some problems our nuns are slow so slow that we use a truncated version of back propagation to Train it and even that's too Hardware intense and also they can't deal with long sequences very well we get gradients that vanish and explode if the network is too long in comes lsdm networks in 1991 that introduced a long short term memory cell in place of dumb neurons this cell has a branch that allows passed information to skip a lot of the processing of the current cell and move on to the next this allows the memory to be retained for longer sequences now to that second point we seem to be able to deal with longer sequences well or are we well kind of probably if the order of hundreds of words instead of a thousand words however to the first point normal our ends are but LS TMS are even slower they are more complex for these RN and LST M networks input data needs to be passed sequentially or serially one after the other we need inputs of the previous state to make any operations on the current state such sequential flow does not make use of today's GPUs very well which are designed for parallel computation so question how can we use parallelization for sequential data in 2017 the Transformer neural network architecture was introduced the network employs an encoder decoder architecture much like recurrent neural Nets the difference is that the input sequence can be passed in parallel consider translating a sentence from English to French I'll use this as a running example throughout the video with an RNN encoder we pass an input English sentence one word after the other the current words hidden state has dependencies in the previous words hidden state the word embeddings are generated one time step at a time with a transformer encoder on the other hand there is no concept of time step for the input we pass in all the words of the sentence simultaneously and determine the word embeddings simultaneously so how is it doing this let's pick a part the transformer architecture I'll make multiple passes on the explanation in the first pass will be like a high overview and the next rounds we'll get into more details let's start with input embeddings computers don't understand words they get numbers they get vectors and matrices the idea is to map every word to a point in space where similar words in meaning are physically closer to each other the space in which they are present is called an embedding space we could pre train this embedding space to save time or even just use an already pre trained embedding space this embedding space Maps a word to a vector but the same word in different sentences may have different meanings this is where positional encoders come in it's a vector that has information on distances between words and the sentence the original paper uses a sine and cosine function to generate this vector but it could be any reasonable function after passing the English sentence through the input embedding and applying the positional encoding we get word vectors that have positional information that is context nice we pass this in to the encoder block where it goes through a multi-headed attention layer and a feed-forward layer okay one at a time attention it involves answering what part of the input should I focus on if we are translating from English to French and we are doing self attention that is attention with respect to oneself the question we want to answer is how relevant is the ithe word in the English sentence relevant to other words in the same English sentence this is represented in the I thought ention vector and it is computed in the attention block for every word we can have an attention vector generated which captures contextual relationships between words in the sentence so that's great the other important unit is a feed-forward net this is just a simple feed-forward neural network that is applied to every one of the attention vectors these feed-forward nets are used in practice to transform the attention vectors into a form that is digestible by the next encoder block or decoder block now that's the high-level overview of the encoder components let's talk about the decoder now during the training phase for English to French we feed the output French sentence to the decoder but remember computers don't get language they get numbers vectors and matrices so we process it using the input embedding to get the vector form of the word and then we add a positional vector to get the notion of context of the word in a sentence we pass this vector finally into a decoder block that has three main components two of which are similar to the encoder block the self attention block generates attention vectors for every word in the french sentence to represent how much each word is related to every word in the same sentence these attention vectors and vectors from the encoder are passed into another attention block let's call this the encoder decoder attention block since we have one vector from every word in the English and French sentences this attention block will determine how related each word vector is with respect to each other and this is where the main English to French word mapping happens the output of this block is attention vectors for every word in English and the French sentence each vector representing the relationships with other words in both the languages next we pass each attention vector to a feed-forward unit this makes the output vector more digestible by the next decoder block or a linear layer now the linear layer is surprise-surprise another feed for connected layer it's used to expand the dimensions into the number of words in the french language the softmax layer transforms it into a probability distribution which is now human interpretable and the final word is the word corresponding to the highest probability overall this decoder predicts the next word and we execute this over multiple time steps until the end of sentence token is generated that's our first passed over the explanation of the entire network architecture for transformers but let's go over it again but this time introduce even more details going deeper an input English sentence is converted into an embedding to represent meaning we add a positional vector to get the context of the word in the sentence our attention block computes the attention vectors for each word only problem here is that the attention vector may not be too strong for every word the attention vector may weight its relation with itself much higher it's true but it's useless we are more interested in interactions with different words and so we determine like eight such attention vectors per word and take a weighted average to compute the final attention vector for every word since we use multiple attention vectors we call it the multi-head attention block the attention vectors are passed in through a feed-forward net one vector at a time the cool thing is that each of the attention nets are independent of each other so we can use some beautiful parallelization here because of this we can pass all our words at the same time into the encoder block and the output will be a set of encoded vectors for every word now the decoder we first obtained the embedding of French words to encode meaning then add the positional value to retain context they are then passed to the first attention block the paper calls this the masked attention block why is this the case though it's because while generating the next French word we can use all the words from the English sentence but only the previous words of the French sentence if we are going to use all the words in the French sentence then there would be no learning it would just spit out the next word so while performing parallelization with matrix operations we make sure that the matrix will mask the words appearing later by transforming it into zeros so the attention network can't use them the next detention block which is the encoder decoder attention block generates similar attention vectors for every English and French word these are passed into the feed-forward layer linear layer and the softmax layer to predict the next word that's the past 2 over the architecture explained I hope you're understanding more and more details here now for the next pass where we go even deeper how exactly do these multi-head attention networks look now the single headed attention looks like this QK and V are abstract vectors that extract different components of an input word we have QK and V vectors for every single word we use these to compute the attention vectors for every word using this kind of formula for a multi-headed attention we have multiple weight matrices you qwk and WV so we will have multiple attention vectors Z for every word however our neural net is only expecting one attention vector per word so we use another weighted matrix wz to make sure that the output is still an attention vector per word additionally after every layer we apply some form of normalization typically we would apply a patch normalization this movements out the law surface making it easier to optimize while using larger learning rates this is the TLDR but that's what it does but we can actually use something called layer normalization making the normalization across each feature instead of each sample it's better for stabilization if you are interested in dabbling in transformer code tensorflow has a step-by-step tutorial that can get you up to speed transformer neural nets have largely replaced LS TM nets for sequence to vector sequence to sequence and vector to sequence problems Google for example created Bert which uses transformers to pre train models for common NLP tasks read that blog it's good however there was another paper called pervasive attention that could be even better than transformers for sequence to sequence models although transformers can be better suited for a wider variety of problems it's still a very interesting read I'll link it in the description below with other resources so check that out hope this helped you get you up to speed with transformer neural nets if you liked the video hit that like button subscribe to stay up to date with some deep learning and machine learning knowledge and I will see you guys in the next one bye bye

Original Description

Please subscribe to keep me alive: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 BLOG: https://medium.com/@dataemporium PLAYLISTS FROM MY CHANNEL ⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc ⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ ⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74 ⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h ⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V ⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Probability: https://imp.i384100.net/Probability OTHER RELATED COURSES (7 day free trial) 📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning 📕 Python for Everybody: https://imp.i384100.net/python 📕 MLOps Course: https://imp.i384100.net/MLOps 📕 Natural Language Processing (NLP): https://imp.i384100.net/NLP 📕 Machine Learning in Production: https://imp.i384100.net/MLProduction 📕 Data Science Specialization: https://imp.i384100.net/DataScience 📕 Tensorflow: https://imp.i384100.net/Tensorflow REFERENCES [1] The main Paper: https://arxiv.org/abs/1706.03762 [2] Tensor2Tensor has some code with a tutorial: https://www.tensorflow

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 38 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video teaches the basics of Transformer neural networks, including their architecture, components, and applications, with a focus on sequence to sequence models and attention mechanisms.

Key Takeaways

Pass input English sentence one word after the other with an RNN encoder
Pass all words of the sentence simultaneously with a Transformer encoder
Apply positional encoding to capture context and distance between words
Compute attention vectors for each word in the attention block
Pass the attention vectors into a feed-forward unit

💡 The Transformer neural network architecture is well-suited for sequence to sequence models and can be pre-trained for common NLP tasks, making it a powerful tool for a wide variety of problems.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 crucial Python concepts to elevate your skills from intermediate to advanced and become a proficient developer

Medium · Data Science

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer

Medium · Programming

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and separate yourself from beginner developers

Medium · Python

Learn Deep Learning by Hand (Beginner's Guide - Part 1)