Transformers Visually Explained
About this lesson
in this video, we build transformers from scratch and understand how they actually work — starting from embeddings and self-attention to multi-head attention, positional encoding, and the full encoder-decoder architecture , along with masking, cross-attention, and the difference between training and inference, so by the end you get a complete and intuitive understanding of how modern LLMs like GPT are built Attention is all you need paper:- https://arxiv.org/abs/1706.03762 These are very important to know before you understand tranformers:- Neural Networks:- https://youtu.be/sE6OaMndGZg Backpropagation:- https://youtu.be/nAMkcgxKwfA Normalization:- https://youtu.be/W2vqsTg-rDU BatchNorm:-https://youtu.be/PaIKIXb3v9Q RNNs:- https://youtu.be/eCwTQYcNG3o Residual Connections:- https://youtu.be/M108HPERPc8 Link for the animation codes:- https://github.com/ByteQuest0/Animation_codes/tree/main/2026/Transfomers 00:00 Introduction – Why Transformers? 02:44 Tokenization and One-Hot Encoding 04:59 Word Embeddings Explained 08:37 Static Embeddings Problem (Bank Example) 15:12 Self-Attention 18:00 Why Scaling by √dk? 20:42 Self Attention Recap 21:33 Multihead Self Attention 25:29 Positional Encoding Intuition 30:45 Transformer Architecture Overview 31:36 Residual Connections + LayerNorm 33:00 Feed Forward Network Explained 33:25 Transformer Architecture Overview 34:00 Masked Multi-Head Attention 37:00 Cross Attention Explained 38:28 Transformer Architecture Overview 39:12 Stacked Layers (Nx) 39:37 Training vs Inference 42:43 Transformers Advantage 🎥 Animations created using Manim: Manim is an open-source Python library for creating mathematical animations. Learn more or try it yourself: 🔗 https://www.manim.community Let's Connect:- GitHub:- https://github.com/ByteQuest0 Reddit:- https://www.reddit.com/r/ByteQuest/
DeepCamp AI