The relationship between convolution & self-attention
Key Takeaways
The video explains the relationship between convolution and self-attention in computer vision, highlighting how Transformer-based models have replaced Convolutional Neural Networks (CNN) with self-attention mechanisms.
Full Transcript
Just like convolution, you can think of self attention as a transformation of an image that operates in a conceptual space. So say we're currently processing this pixel. According to established terminology, this is the query pixel. To process it, we'll take all the image pixels into account and we'll call them key pixels. And yes, the query pixel itself is also part of the keys. It plays a double role. If you're struggling with this terminology, remember that the transformer comes from Google, which is a search company. The image is like a database of pixels. And the query pixel is basically trying to retrieve relevant pixels from this imaginary database. When retrieving from a regular database, we would pick the top k most relevant keys. That's called hard retrieval. But self attention does soft retrieval instead. It associates a weight or attention score to each key pixel reflecting its relevance to the query pixel. So the output pixel is a weighted sum with a constraint that attention scores should be positive and sum up to one. The amount of attention that a query pixel should attribute to a key pixel kined by a similarity measure in vector space like a dot product. What we have here is a complete definition for the output pixel, but it's fixed. It's not learned just like the blurring kernel. In the context of a neural network, we'll generalize this operation by passing all pixel vectors through linear transformations. So, we'll multiply them by learned matrices. The query pixel gets a query matrix. The key pixels share the same key matrix. And the output pixel gets a final value matrix. And that is self attention minus some constants that I'm emitting for brevity. When we compare convolution and self attention side by side, one difference jumps out immediately. A convolutional layer has a strictly local receptive field. Each pixel can only interact with its neighbors. Self attention has no such constraint. A pixel can look at any other pixel in the image in a single step.
Original Description
Full video: https://youtu.be/KnCRTP11p5U?si=SP2WfoTYZQlTKzRN
This is a clip from a full deep-dive that explains why Transformer-based models have replaced Convolutional Neural Networks (CNN) in computer vision.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Modern CV Models
View skill →Related Reads
📰
📰
📰
📰
I Found the Neural Network I Built in Class 9 — Here’s What Happened When I Tried to Run It Again
Medium · Deep Learning
Introduction to Deep Learning and Neural Networks: From Human Brain to Artificial Intelligence
Medium · Deep Learning
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI