Why are Transformers replacing CNNs?
Why does a Transformer classify this cat as a cat… while a ResNet calls it a macaw?
In this video we break down one of the biggest shifts in computer vision: why Transformers replaced Convolutional Neural Networks (CNNs) — even though CNNs were designed for images and Transformers for language.
We’ll compare convolution vs self-attention, explore CNNs’ inductive biases (locality, translation invariance, hierarchical features), and see why self-attention is strictly more expressive than convolution. You’ll also learn how attention can exactly implement convolutional kernels using relative pos…
Watch on YouTube ↗
(saves to browser)
Chapters (8)
Intro
1:30
The convolution operation
3:34
Convolutional Neural Networks (CNNs)
5:51
The inductive bias in CNNs
7:22
Self-attention
10:39
Self-attention can implement convolutions
14:17
Computational power & multi-modality
16:03
ChatGPT can be funny
DeepCamp AI