Why are Transformers replacing CNNs?

Julia Turc · Beginner ·👁️ Computer Vision ·5mo ago

Skills: Modern CV Models90%ML Pipelines50%

Why does a Transformer classify this cat as a cat… while a ResNet calls it a macaw? In this video we break down one of the biggest shifts in computer vision: why Transformers replaced Convolutional Neural Networks (CNNs) — even though CNNs were designed for images and Transformers for language. We’ll compare convolution vs self-attention, explore CNNs’ inductive biases (locality, translation invariance, hierarchical features), and see why self-attention is strictly more expressive than convolution. You’ll also learn how attention can exactly implement convolutional kernels using relative positional encodings. 📚 Resources: - On the Relationship between Self-Attention and Convolutional Layers: https://arxiv.org/abs/1911.03584 - Backpropagation Applied to Handwritten Zipcode Recognition: http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf - AlexNet (the paper that popularized CNNs in deep learning): https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf - The Transformer: https://arxiv.org/abs/1706.03762 00:00 Intro 01:30 The convolution operation 03:34 Convolutional Neural Networks (CNNs) 05:51 The inductive bias in CNNs 07:22 Self-attention 10:39 Self-attention can implement convolutions 14:17 Computational power & multi-modality 16:03 ChatGPT can be funny

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Modern CV Models

View skill →

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Nicholas Renotte

Deep Learning with PyTorch : Image Segmentation

Deep Learning with PyTorch : Image Segmentation

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

NVIDIA Developer

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Related AI Lessons

Inside SAM 3D: how Meta turns a single image into 3D

Learn how Meta's SAM 3D technology turns a single image into 3D, revolutionizing the field of computer vision

Medium · Machine Learning

Inside SAM 3D: how Meta turns a single image into 3D

Learn how Meta's SAM 3D technology generates 3D models from single images, revolutionizing the field of computer vision

Medium · Deep Learning

Demystifying CNNs: How Convolutional Filters and Max-Pooling Actually Work

Learn how Convolutional Neural Networks (CNNs) use convolutional filters and max-pooling to recognize images

Medium · Data Science

Your "Biometric Age Check" Isn't Verifying Identity — And Defense Lawyers Know It

Biometric age checks don't verify identity, a crucial distinction for developers in computer vision and biometrics

Chapters (8)

Intro

1:30 The convolution operation

3:34 Convolutional Neural Networks (CNNs)

5:51 The inductive bias in CNNs

7:22 Self-attention

10:39 Self-attention can implement convolutions

14:17 Computational power & multi-modality

16:03 ChatGPT can be funny

How Transformers Finally Ate Vision – Isaac Robinson, Roboflow