Attention in Neural Networks

CodeEmporium · Beginner ·🧬 Deep Learning ·8y ago

Key Takeaways

The video discusses attention mechanisms in neural networks, covering soft and hard attention, and their applications in various tasks such as visual attention, neural machine translation, and generative adversarial networks.

Full Transcript

[Music] say I give you the deep learning book along with the question how is convolution equivalent with respect to translation what would you do to answer this question well one way you can do this is to read the entire book and assuming you remember everything you've read try to answer the question but there's a better way since it's a question on convolution I flip to the chapter on convolution neural networks then I find equivalence as one of the properties and read out that page or at least that part of the page which do you think is a faster method if we read the entire text like in the first method answering the question may take us a few weeks but in the second method the same can be done within a few minutes that's a very big difference furthermore our answer while reading the entire book may be more vague as it's based on too much information what did we do differently here in the former case we didn't focus on any part of the book specifically whereas in the latter case we focused our attention to the chapter on convolution neural networks and then further focused our attention to the part where the concept of equal variance is explained this second approach would be the exact thought process many of us humans would take it's quite intuitive given this example scenario we can now better define atencion atencion mechanisms found in neural networks is somewhat similar to that found in humans they focus in high resolution on certain parts of the input while the rest of the input is in low resolution or blurred in this video I'm going to talk about the attention mechanism applied on image inputs let's take a look at visual attention at a higher level consider the problem of determining appropriate captions for an input image based on the papers show tell and attend this normally consists of two steps first is to encode the image in an internal vector representation H using a convolution neural network and then we decode H into word vectors signifying the captions using a recurrent neural network the problem with this method is when generating a single word of the caption the LST M looks at the entire image representation H every time this is not very efficient as usually we generate different words of a caption looking at different and specific parts of an image to solve this problem we create n different non-overlapping sub regions hence H I would be the internal feature representation used to generate the eighth word it is not necessarily the representation of the I 3 gene of the original image I'll explain this in a bit for now the figure on screen is a high-level diagram of attention when the decoder decides on a caption for every word it only looks at specific regions of the image leading to a more accurate description now that's good but how does it exactly decide the region or regions to consider this is the crux of the attention mechanism an attention unit considers all sub regions and contexts as its input and it outputs the weighted arithmetic mean of these regions arithmetic mean is the inner product of actual values and their probabilities how are these probabilities and weights determined they are determined using the context C context represents everything that recurrent neural network has output until now let's take a closer look at what happens we have input regions Y from the convolution neural net and the context see from the RN these inputs are applied to weights which constitute the learn about parameters of the attention unit this means the weight vectors update as we get more training data we apply a tange activation so that of very high values tend to have very small differences and be close to one and very low values also a very small difference is closer to minus one this leads to a much smoother choice of regions of interest within each sub region it is more fine-grained so to speak note we don't necessarily have to apply a tange function we only need to ensure the regions that we output are relevant to the context in the simplest form this similarity can be determined with a simple dot product between the regions Y and the context C the more similar they are the higher is the product hence the output is guaranteed to weight the more relevant region why I hire the difference of using the simple inner product and tange function would be grin you ality of the output regions of interest tange is more fine-grained with less choppy and smoother parts of sub regions chosen regardless of how they are calculated these M's are then passed through a softmax function which outputs them as probabilities s finally we take the inner product of this probability vector s and the sub regions Y to get the final output Z of relevant regions of the entire image understand the probabilities as correspond to the relevance of the sub regions Y given the context C now there are two types of attention mechanisms the first is soft attention and then we have hard attention the main difference here is that in soft attention the main relevant region C consists of different parts of different sub regions wide in heart attention the main relevant region Z consists of only one of the regions why I'll explain them both in detail the entire mechanism of attention that I described until now is all soft attention Z has relevant parts of different regions soft attention is deterministic so deterministic what's that a system is said to be deterministic if the application of an action a on a state s always leads to the same state s prime a dumb example would be you're at a corner of your room at coordinates 0 0 and you're facing forward consider an action a which is moving 5 feet forward the system is now at a new state with the coordinates 5 0 and still facing forward no matter how many times you stand at the corner of your room forward facing and walk five feet forward you will always end up 5 feet from the door and facing forward try it trust me it works hence the system is deterministic let us apply the same concept to soft attention initially we have an image just split into a number of regions why with an input context see this is our initial state on the application of soft attention we end up with a localized image representing the new state s Prime these regions of interest are determined from Z the RO eyes will always be the same regardless of how many times we execute soft attention with these same inputs this is because we consider all the regions Y anyways to determine Z now consider heart attention looking at the architecture heart attention is very similar to soft attention however instead of taking the weighted arithmetic mean of all regions heart attention only considers one region randomly so heart attention is a stochastic process now stochastic when you hear the word stochastic think about randomness in such a stochastic process performing an action a on a state s may lead to different states every time typical example is like in a board game with the dice like snakes and ladders the initial state is the position of the players the action is rolling a dice and depending on the roll there are multiple possibilities for the next board state what makes hard attention stochastic is that a region Y I is chosen randomly with the probability si this means that the more relevant a region Y I as a whole is relevant to the context then greater the chance it is chosen for determining the next word of the caption using the word captions output until now by the RNN that is H along the current regions of interest in an image determined by the attention mechanism the RNN now tries to predict the next word in the caption as far as performance is concerned in the papers show attend Intel released by the University of Toronto and University of Montreal in 2016 results vary with the data set soft and heart attention both perform decently well with heart attention performing it slightly better this is pretty cool right so where else can we use attention attention is not only used for image inputs for example neural machine translation nmt systems they are used to translate one language to another words are fed in a sequence to an encoder one after another and the sentence is terminated by a specific input word or symbol once complete the special signal initiates the decoder phase where the translated words are generated another cool application would be Microsoft's attention generative adversarial networks or Microsoft's attention gann that can create images from text through natural language processing it can perform fine-grained tasks like generating parts of an image from a single word in the description another application would be in the paper of teaching machines to read and comprehend the altars do the same thing I talked about in the beginning of the video a recurrent neural network takes some text and a question as input and it is made to output an answer here are some things to remember attention involves focus in high resolution on certain parts of an input while the rest of the input is in low resolution or is blurred two types of attention are soft attention and hard attention soft attention is deterministic while hard attention is stochastic attention can be used for non image inputs like neural machine translation attention Gantz and answering questions from text and that's all I have for you now hope you guys got some newfound understanding of attention and its applications in this video I have left a link to the main paper show attend intel with other papers and blog posts in the description down below don't forget to give the video a thumbs up and subscribe for more awesome content please subscribe please you did it right guys

Original Description

In this video, we discuss Attention in neural networks. We go through Soft and hard attention, discuss the architecture with examples. SUBSCRIBE to the channel for more awesome content! My video on Generative Adversarial Networks: https://www.youtube.com/watch?v=O8LAi6ksC80 My video on Convolution Neural Networks: https://www.youtube.com/watch?v=m8pOnJxOcqY INVESTING [1] Webull (You can get 3 free stocks setting up a webull account today): https://a.webull.com/8XVa1znjYxio6ESdff REFERENCES Show attend and tell (Image Captioning): https://arxiv.org/pdf/1502.03044.pdf What is attention: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Attention is all you need: https://arxiv.org/pdf/1706.03762v5.pdf Nice blog on Attention: http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ Feed Forward + Attention can solve problems: https://arxiv.org/pdf/1512.08756.pdf Teaching Machines to Read and Comprehend: https://arxiv.org/pdf/1506.03340.pdf
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 7 of 60

1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
8 Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
39 Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
40 Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video teaches the basics of attention mechanisms in neural networks and their applications in various tasks, including visual attention, neural machine translation, and generative adversarial networks. The viewer will learn how attention mechanisms work and how they can be used to improve the performance of neural networks. By the end of the video, the viewer will be able to understand and apply attention mechanisms to various tasks.

Key Takeaways
  1. Understand the basics of attention mechanisms
  2. Learn about soft and hard attention
  3. Apply attention mechanisms to visual attention tasks
  4. Apply attention mechanisms to neural machine translation tasks
  5. Apply attention mechanisms to generative adversarial networks
💡 Attention mechanisms can be used to focus on high-resolution parts of the input while the rest of the input is in low resolution or blurred, which can improve the performance of neural networks.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →