Show, Attend and Tell

Connor Shorten · Intermediate ·📄 Research Papers Explained ·6y ago

Key Takeaways

The video explains the Show, Attend and Tell neural image captioning model, which integrates visual attention into the process of generating word descriptions from visual features using a combination of CNN features and LSTM language decoders with an attention layer. The model is used for image captioning and has been adopted into various tasks such as object detection and scene graph generation.

Full Transcript

[Music] this video will present a neural image captioning model named show attended tell this paper was published in 2016 and co-authored by a famous professor yoshua bengio this model is unlike other image captioning models because it integrates visual attention into the process of generating the word descriptions from visual features so image captioning is this idea of taking an image and then producing a sentence that describes the image such as in this example a stop sign is on a road with a mountain in the background this is really interesting because it goes past things like object detection where you would say there's a stop sign and it's exactly in this box or pixel maps even that would I close the stop sign in red the sky in blue and the mountains is like green or something with you know labeling each pixel in the image this goes beyond that this this requires not just understanding the objects but also understanding the relationships between the objects and then the ability to compress complex visual information into a concise descriptive language so image captioning is really interesting but it's far from a solved problem in deep learning this is one of the examples they present in their paper where the model describes this guy with the violin on his chin as a man wearing a hat and a hat on a skateboard he's where it's who has so one approaches presented recently as seen graphs and this is something that I think is pretty interesting where they take these scenes in a and they explicitly model the relationships and images with these graphs where there are like noun objects connected by these verbs I think it's an interesting idea but it definitely hasn't in seen as much adoption as this show at N and tell model that's gonna be described in this video so another way of doing this that this is a new capital of competition the Google open images 2019 visual relationship challenge where they put bounding boxes on things like man and then guitar and then a third bounding box which describes the relationship between objects so you see this guy also a guy guitar and then the relationship bounding box so show tenant L is going to be fully end and depending on the well it is fully at end end but it depends on the whether you want to use a pre-trained CNN feature extractor which is going to be used in this paper and then the hard or soft attention mechanism which will be described later in the presentation so attention is really interesting a new layer and deep learning it first came out in a paper attention is all you need showing how effective it could be for neuro machine translation and it's been adopted into all sorts of things like even rez nets self attention Gann and in this paper using it for image captioning so the first part of the model is the CNN encoder so the paper is going to use the vgg architecture which was you know on top of the state-of-the-art in 2016 so it's gonna take 512 with these 14 by 14 feature maps and this is different from the image capturing models that became for this paper that use these vectors that were at the end of fully connected layers of CNN's so the visual features there are 14 by 14 of these maps they're gonna be flattened out into each one is going to be 196 14 x 14 so each feature map is going to make up a vector and it's going to be flattened out into this matrix so in this matrix each this would be like a 0 a 1 a 2 so a I in Rd this isn't really the right R term that they use but it means that each of these vectors has the dimensionality of 196 so the attention is going to dynamically look at this feature map and it's going to like extract what's relevant for the LS TM decoder at each step of generating the description of the image so as a reminder of the architecture overview this section right here is the convolutional feature extractor and so these feature inputs are feature maps are going to be input to the RNN with attention as it produces the image caption that describes the image so one other interesting thing following reading the paper I looked into like an implementation of this in PI torch so what this shows is the author of the repository aaron wong he shows how a ResNet 152 outperforms bgg 19 if you use it for extracting the features on the show attend and tell captioning model so next part of this model is how it generates the words the lsdm with attention this is probably the most interesting and important detail to take away from the paper so Ellis teams have these gates they have an input modulator an input gate they have a hidden cell state a forget gate and an output gate so each of these gates are parameterize these represent like neural networks that take the input and then modify them in some way before they reach the gate so Z prime T is the context vector the attention mechanism gives the LS TM this component this is the only component that the attention modifies of the LCM so then HT minus 1 represents the last word predicted by the LS TM so let's say it said the stop and then stop is the last word so it's going to take in stop and then use it to maybe predict sign and then so the YT minus 1 is the is like the encoding so the Y is a word token which is usually one hot encoded so it would be like a vector where it's like 0 0 0 1 and then ton more zeroes so E is an embedding matrix so he has dimensionality M by K and then it's multiplied by this K by 1 1 hot vector so that you end up with a why that is dent it's not like 0 0 0 1 0 0 0 it's something like 0.18 0.5 something like that which is easier for these models to work with so basically all that happens is that it takes in the previous input the previous the previous output and then the previous Y output the context vector and then it modifies the hidden state of the LS TM and then it outputs the next token so this is how the context vector is produced so what happens is the attention model takes in that a feature map that we described here this is the a the attention is going to look at this and as well as the word that it was just predicted so it looked at the feature look it's a stop and then it would produce this this alpha mask over the feature map that is then going to be used to like parse the convolutional neural network features so then the predict in the next word in the caption is basically a combination of these terms the embedded matrix of the y term the output and in the context vector times these parameters so another interesting thing with attention is the hardware soffit ention so hard attention is where you define a multinomial multi moly distribution over the A's so with hard attention you don't try you don't train it in the same way as you do back propagation you have to like use the reinforce algorithm and use like a Monte Carlo sampling because what you're doing is you're sampling from a probability distribution you're not just deterministically predicting an output so the hard attention would present like a distribution over the over the feature maps in a and the feature maps in a they don't correspond to the rows and columns in the original image they correspond to like features learn through like convolutional kernels the series's of convolution of kernels optimized for the downstream and probably image in that classification tasks that the VDD network was originally designed for so what you so what it would do is it would sample probabilistically from one of the a vectors and use that as the attention context vector but the soft attention you can train this and end with back propagation because soft attention it just produces a differentiable mask over the a the A's from the convolutional Network it doesn't sample one discrete like feature vector from the ASA so one other clever training trick is that they describe that when you're training these image captioning systems you have like your mini batch first stochastic gradient descent and if one caption is much longer than others then the short captions gonna have to wait for it to be finished so they do is they construct a dictionary and then they patch together the captioned images that have the same length to speed up training so before presenting the results just a quick note on the blue score and what this is it's a technique developed in 2002 to evaluate a neural machine translation systems so it's basically like an Engram Engram evaluation system with some reference sentences so in the results each of the datasets Flickr 8k 30k and m/s cocoa they all have five human annotated captions for the images and then so the blue score is basically a way of saying how well the image caption aligns with the reference captions so these are the blue scores from the both self attention and heart attention are this show attendant tell model and then these are some of the previously used techniques for image captioning so across all of the datasets Flickr 8k Flickr 30k and cocoa the the show ten and tell model performance pretty significantly better maybe not so much over the log by the linear model in Cocoa but in Flickr 30k I mean generally though it's pretty inconsistent but it's still probably the easiest model to implement and then the model that has like the most hyper parameters hooning that would maybe push this even farther ahead so thanks for watching this model on show attendant l a really interesting model for image captioning please subscribe to Henry AI labs for more deep learning videos

Original Description

This video explains an amazing image captioning model that builds on using a combination of visual CNN features + LSTM language decoders by adding an attention layer to the LSTM decoder. Thanks for watching! Please Subscribe! Paper Link: https://arxiv.org/pdf/1502.03044.pdf
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 48 of 60

1 DenseNets
DenseNets
Connor Shorten
2 DeepWalk Explained
DeepWalk Explained
Connor Shorten
3 Inception Network Explained
Inception Network Explained
Connor Shorten
4 StackGAN
StackGAN
Connor Shorten
5 StyleGAN
StyleGAN
Connor Shorten
6 Progressive Growing of GANs Explained
Progressive Growing of GANs Explained
Connor Shorten
7 Improved Techniques for Training GANs
Improved Techniques for Training GANs
Connor Shorten
8 Word2Vec Explained
Word2Vec Explained
Connor Shorten
9 Must Read Papers on GANs
Must Read Papers on GANs
Connor Shorten
10 Unsupervised Feature Learning
Unsupervised Feature Learning
Connor Shorten
11 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
12 Embedding Graphs with Deep Learning
Embedding Graphs with Deep Learning
Connor Shorten
13 Transfer Learning in GANs
Transfer Learning in GANs
Connor Shorten
14 ReLU Activation Function
ReLU Activation Function
Connor Shorten
15 AC-GAN Explained
AC-GAN Explained
Connor Shorten
16 SimGAN Explained
SimGAN Explained
Connor Shorten
17 DC-GAN Explained!
DC-GAN Explained!
Connor Shorten
18 ResNet Explained!
ResNet Explained!
Connor Shorten
19 Graph Convolutional Networks
Graph Convolutional Networks
Connor Shorten
20 Neural Architecture Search
Neural Architecture Search
Connor Shorten
21 Henry AI Labs
Henry AI Labs
Connor Shorten
22 Video Classification with Deep Learning
Video Classification with Deep Learning
Connor Shorten
23 BigGANs in Data Augmentation
BigGANs in Data Augmentation
Connor Shorten
24 Introduction to Deep Learning
Introduction to Deep Learning
Connor Shorten
25 EfficientNet Explained!
EfficientNet Explained!
Connor Shorten
26 Self-Attention GAN
Self-Attention GAN
Connor Shorten
27 Curriculum Learning in Deep Neural Networks
Curriculum Learning in Deep Neural Networks
Connor Shorten
28 Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
29 Deep Compression
Deep Compression
Connor Shorten
30 Skin Cancer Classification with Deep Learning
Skin Cancer Classification with Deep Learning
Connor Shorten
31 Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
32 The Lottery Ticket Hypothesis Explained!
The Lottery Ticket Hypothesis Explained!
Connor Shorten
33 SqueezeNet
SqueezeNet
Connor Shorten
34 GauGAN Explained!
GauGAN Explained!
Connor Shorten
35 AutoML with Hyperband
AutoML with Hyperband
Connor Shorten
36 DL Podcast #3 | Yannic Kilcher | Population-Based Search
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
37 Weakly Supervised Pretraining
Weakly Supervised Pretraining
Connor Shorten
38 Image Data Augmentation for Deep Learning
Image Data Augmentation for Deep Learning
Connor Shorten
39 Unsupervised Data Augmentation
Unsupervised Data Augmentation
Connor Shorten
40 Wide ResNet Explained!
Wide ResNet Explained!
Connor Shorten
41 RevNet: Backpropagation without Storing Activations
RevNet: Backpropagation without Storing Activations
Connor Shorten
42 GANs with Fewer Labels
GANs with Fewer Labels
Connor Shorten
43 BigBiGAN Unsupervised Learning!
BigBiGAN Unsupervised Learning!
Connor Shorten
44 Self-Supervised Learning
Self-Supervised Learning
Connor Shorten
45 Multi-Task Self-Supervised Learning
Multi-Task Self-Supervised Learning
Connor Shorten
46 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
47 Population Based Training
Population Based Training
Connor Shorten
Show, Attend and Tell
Show, Attend and Tell
Connor Shorten
49 Siamese Neural Networks
Siamese Neural Networks
Connor Shorten
50 WaveGAN Explained!
WaveGAN Explained!
Connor Shorten
51 VAE-GAN Explained!
VAE-GAN Explained!
Connor Shorten
52 Evolution in Neural Architecture Search!
Evolution in Neural Architecture Search!
Connor Shorten
53 AI Research Weekly Update August 18th, 2019
AI Research Weekly Update August 18th, 2019
Connor Shorten
54 Weight Agnostic Neural Networks Explained!
Weight Agnostic Neural Networks Explained!
Connor Shorten
55 AI Research Weekly Update August 25th, 2019
AI Research Weekly Update August 25th, 2019
Connor Shorten
56 Neuroevolution of Augmenting Topologies (NEAT)
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
57 CoDeepNEAT
CoDeepNEAT
Connor Shorten
58 AI Research Weekly Update September 1st, 2019
AI Research Weekly Update September 1st, 2019
Connor Shorten
59 Randomly Wired Neural Networks
Randomly Wired Neural Networks
Connor Shorten
60 Genetic CNN
Genetic CNN
Connor Shorten

The video explains the Show, Attend and Tell model, which uses visual attention to generate image captions. The model combines CNN features and LSTM language decoders with an attention layer to produce accurate captions. The video also discusses the use of soft attention, hard attention, and the BLUE score for evaluation.

Key Takeaways
  1. Read the paper on Show, Attend and Tell
  2. Implement the model using a CNN encoder and LSTM decoder
  3. Add an attention layer to the LSTM decoder
  4. Use soft attention or hard attention
  5. Evaluate the model using the BLUE score
  6. Fine-tune the model for better performance
  7. Apply the model to various tasks such as object detection and scene graph generation
💡 The use of visual attention in the Show, Attend and Tell model allows it to focus on specific parts of the image when generating captions, leading to more accurate results.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
1942: Hitler's Gamble for Victory by Richard Hargreaves · Audiobook preview
Google Play Books
Watch →