Show, Attend and Tell

Connor Shorten · Intermediate ·📄 Research Papers Explained ·6y ago

Skills: Reading ML Papers90%Research Methods80%LLM Foundations70%RAG Basics60%

Key Takeaways

The video explains the Show, Attend and Tell neural image captioning model, which integrates visual attention into the process of generating word descriptions from visual features using a combination of CNN features and LSTM language decoders with an attention layer. The model is used for image captioning and has been adopted into various tasks such as object detection and scene graph generation.

Full Transcript

[Music] this video will present a neural image captioning model named show attended tell this paper was published in 2016 and co-authored by a famous professor yoshua bengio this model is unlike other image captioning models because it integrates visual attention into the process of generating the word descriptions from visual features so image captioning is this idea of taking an image and then producing a sentence that describes the image such as in this example a stop sign is on a road with a mountain in the background this is really interesting because it goes past things like object detection where you would say there's a stop sign and it's exactly in this box or pixel maps even that would I close the stop sign in red the sky in blue and the mountains is like green or something with you know labeling each pixel in the image this goes beyond that this this requires not just understanding the objects but also understanding the relationships between the objects and then the ability to compress complex visual information into a concise descriptive language so image captioning is really interesting but it's far from a solved problem in deep learning this is one of the examples they present in their paper where the model describes this guy with the violin on his chin as a man wearing a hat and a hat on a skateboard he's where it's who has so one approaches presented recently as seen graphs and this is something that I think is pretty interesting where they take these scenes in a and they explicitly model the relationships and images with these graphs where there are like noun objects connected by these verbs I think it's an interesting idea but it definitely hasn't in seen as much adoption as this show at N and tell model that's gonna be described in this video so another way of doing this that this is a new capital of competition the Google open images 2019 visual relationship challenge where they put bounding boxes on things like man and then guitar and then a third bounding box which describes the relationship between objects so you see this guy also a guy guitar and then the relationship bounding box so show tenant L is going to be fully end and depending on the well it is fully at end end but it depends on the whether you want to use a pre-trained CNN feature extractor which is going to be used in this paper and then the hard or soft attention mechanism which will be described later in the presentation so attention is really interesting a new layer and deep learning it first came out in a paper attention is all you need showing how effective it could be for neuro machine translation and it's been adopted into all sorts of things like even rez nets self attention Gann and in this paper using it for image captioning so the first part of the model is the CNN encoder so the paper is going to use the vgg architecture which was you know on top of the state-of-the-art in 2016 so it's gonna take 512 with these 14 by 14 feature maps and this is different from the image capturing models that became for this paper that use these vectors that were at the end of fully connected layers of CNN's so the visual features there are 14 by 14 of these maps they're gonna be flattened out into each one is going to be 196 14 x 14 so each feature map is going to make up a vector and it's going to be flattened out into this matrix so in this matrix each this would be like a 0 a 1 a 2 so a I in Rd this isn't really the right R term that they use but it means that each of these vectors has the dimensionality of 196 so the attention is going to dynamically look at this feature map and it's going to like extract what's relevant for the LS TM decoder at each step of generating the description of the image so as a reminder of the architecture overview this section right here is the convolutional feature extractor and so these feature inputs are feature maps are going to be input to the RNN with attention as it produces the image caption that describes the image so one other interesting thing following reading the paper I looked into like an implementation of this in PI torch so what this shows is the author of the repository aaron wong he shows how a ResNet 152 outperforms bgg 19 if you use it for extracting the features on the show attend and tell captioning model so next part of this model is how it generates the words the lsdm with attention this is probably the most interesting and important detail to take away from the paper so Ellis teams have these gates they have an input modulator an input gate they have a hidden cell state a forget gate and an output gate so each of these gates are parameterize these represent like neural networks that take the input and then modify them in some way before they reach the gate so Z prime T is the context vector the attention mechanism gives the LS TM this component this is the only component that the attention modifies of the LCM so then HT minus 1 represents the last word predicted by the LS TM so let's say it said the stop and then stop is the last word so it's going to take in stop and then use it to maybe predict sign and then so the YT minus 1 is the is like the encoding so the Y is a word token which is usually one hot encoded so it would be like a vector where it's like 0 0 0 1 and then ton more zeroes so E is an embedding matrix so he has dimensionality M by K and then it's multiplied by this K by 1 1 hot vector so that you end up with a why that is dent it's not like 0 0 0 1 0 0 0 it's something like 0.18 0.5 something like that which is easier for these models to work with so basically all that happens is that it takes in the previous input the previous the previous output and then the previous Y output the context vector and then it modifies the hidden state of the LS TM and then it outputs the next token so this is how the context vector is produced so what happens is the attention model takes in that a feature map that we described here this is the a the attention is going to look at this and as well as the word that it was just predicted so it looked at the feature look it's a stop and then it would produce this this alpha mask over the feature map that is then going to be used to like parse the convolutional neural network features so then the predict in the next word in the caption is basically a combination of these terms the embedded matrix of the y term the output and in the context vector times these parameters so another interesting thing with attention is the hardware soffit ention so hard attention is where you define a multinomial multi moly distribution over the A's so with hard attention you don't try you don't train it in the same way as you do back propagation you have to like use the reinforce algorithm and use like a Monte Carlo sampling because what you're doing is you're sampling from a probability distribution you're not just deterministically predicting an output so the hard attention would present like a distribution over the over the feature maps in a and the feature maps in a they don't correspond to the rows and columns in the original image they correspond to like features learn through like convolutional kernels the series's of convolution of kernels optimized for the downstream and probably image in that classification tasks that the VDD network was originally designed for so what you so what it would do is it would sample probabilistically from one of the a vectors and use that as the attention context vector but the soft attention you can train this and end with back propagation because soft attention it just produces a differentiable mask over the a the A's from the convolutional Network it doesn't sample one discrete like feature vector from the ASA so one other clever training trick is that they describe that when you're training these image captioning systems you have like your mini batch first stochastic gradient descent and if one caption is much longer than others then the short captions gonna have to wait for it to be finished so they do is they construct a dictionary and then they patch together the captioned images that have the same length to speed up training so before presenting the results just a quick note on the blue score and what this is it's a technique developed in 2002 to evaluate a neural machine translation systems so it's basically like an Engram Engram evaluation system with some reference sentences so in the results each of the datasets Flickr 8k 30k and m/s cocoa they all have five human annotated captions for the images and then so the blue score is basically a way of saying how well the image caption aligns with the reference captions so these are the blue scores from the both self attention and heart attention are this show attendant tell model and then these are some of the previously used techniques for image captioning so across all of the datasets Flickr 8k Flickr 30k and cocoa the the show ten and tell model performance pretty significantly better maybe not so much over the log by the linear model in Cocoa but in Flickr 30k I mean generally though it's pretty inconsistent but it's still probably the easiest model to implement and then the model that has like the most hyper parameters hooning that would maybe push this even farther ahead so thanks for watching this model on show attendant l a really interesting model for image captioning please subscribe to Henry AI labs for more deep learning videos

Original Description

This video explains an amazing image captioning model that builds on using a combination of visual CNN features + LSTM language decoders by adding an attention layer to the LSTM decoder. Thanks for watching! Please Subscribe! Paper Link: https://arxiv.org/pdf/1502.03044.pdf

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 48 of 60

← Previous Next →

DeepWalk Explained

DeepWalk Explained

Inception Network Explained

Inception Network Explained

Progressive Growing of GANs Explained

Progressive Growing of GANs Explained

Improved Techniques for Training GANs

Improved Techniques for Training GANs

Word2Vec Explained

Word2Vec Explained

Must Read Papers on GANs

Must Read Papers on GANs

Unsupervised Feature Learning

Unsupervised Feature Learning

Self-Supervised GANs

Self-Supervised GANs

Embedding Graphs with Deep Learning

Embedding Graphs with Deep Learning

Transfer Learning in GANs

Transfer Learning in GANs

ReLU Activation Function

ReLU Activation Function

AC-GAN Explained

AC-GAN Explained

SimGAN Explained

SimGAN Explained

DC-GAN Explained!

DC-GAN Explained!

ResNet Explained!

ResNet Explained!

Graph Convolutional Networks

Graph Convolutional Networks

Neural Architecture Search

Neural Architecture Search

Video Classification with Deep Learning

Video Classification with Deep Learning

BigGANs in Data Augmentation

BigGANs in Data Augmentation

Introduction to Deep Learning

Introduction to Deep Learning

EfficientNet Explained!

EfficientNet Explained!

Self-Attention GAN

Self-Attention GAN

Curriculum Learning in Deep Neural Networks

Curriculum Learning in Deep Neural Networks

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Compression

Deep Compression

Skin Cancer Classification with Deep Learning

Skin Cancer Classification with Deep Learning

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

The Lottery Ticket Hypothesis Explained!

The Lottery Ticket Hypothesis Explained!

GauGAN Explained!

GauGAN Explained!

AutoML with Hyperband

AutoML with Hyperband

DL Podcast #3 | Yannic Kilcher | Population-Based Search

DL Podcast #3 | Yannic Kilcher | Population-Based Search

Weakly Supervised Pretraining

Weakly Supervised Pretraining

Image Data Augmentation for Deep Learning

Image Data Augmentation for Deep Learning

Unsupervised Data Augmentation

Unsupervised Data Augmentation

Wide ResNet Explained!

Wide ResNet Explained!

RevNet: Backpropagation without Storing Activations

RevNet: Backpropagation without Storing Activations

GANs with Fewer Labels

GANs with Fewer Labels

BigBiGAN Unsupervised Learning!

BigBiGAN Unsupervised Learning!

Self-Supervised Learning

Self-Supervised Learning

Multi-Task Self-Supervised Learning

Multi-Task Self-Supervised Learning

Self-Supervised GANs

Self-Supervised GANs

Population Based Training

Population Based Training

Show, Attend and Tell

Show, Attend and Tell

Siamese Neural Networks

Siamese Neural Networks

WaveGAN Explained!

WaveGAN Explained!

VAE-GAN Explained!

VAE-GAN Explained!

Evolution in Neural Architecture Search!

Evolution in Neural Architecture Search!

AI Research Weekly Update August 18th, 2019

AI Research Weekly Update August 18th, 2019

Weight Agnostic Neural Networks Explained!

Weight Agnostic Neural Networks Explained!

AI Research Weekly Update August 25th, 2019

AI Research Weekly Update August 25th, 2019

Neuroevolution of Augmenting Topologies (NEAT)

Neuroevolution of Augmenting Topologies (NEAT)

AI Research Weekly Update September 1st, 2019

AI Research Weekly Update September 1st, 2019

Randomly Wired Neural Networks

Randomly Wired Neural Networks

The video explains the Show, Attend and Tell model, which uses visual attention to generate image captions. The model combines CNN features and LSTM language decoders with an attention layer to produce accurate captions. The video also discusses the use of soft attention, hard attention, and the BLUE score for evaluation.

Key Takeaways

Read the paper on Show, Attend and Tell
Implement the model using a CNN encoder and LSTM decoder
Add an attention layer to the LSTM decoder
Use soft attention or hard attention
Evaluate the model using the BLUE score
Fine-tune the model for better performance
Apply the model to various tasks such as object detection and scene graph generation

💡 The use of visual attention in the Show, Attend and Tell model allows it to focus on specific parts of the image when generating captions, leading to more accurate results.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

1942: Hitler's Gamble for Victory by Richard Hargreaves · Audiobook preview

Google Play Books