Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Umar Jamil · Beginner ·📐 ML Fundamentals ·1y ago
Full coding of a Multimodal (Vision) Language Model from scratch using only Python and PyTorch. We will be coding the PaliGemma Vision Language Model from scratch while explaining all the concepts behind it: - Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax) - Vision Transformer model - Contrastive learning (CLIP, SigLip) - Numerical stability of the Softmax and the Cross Entropy Loss - Rotary Positional Embedding - Multi-Head Attention - Grouped Query Attention - Normalization layers (Batch, Layer and RMS) - KV-Cache (prefilling …
Watch on YouTube ↗ (saves to browser)

Chapters (14)

Introduction
5:52 Contrastive Learning and CLIP
16:50 Numerical stability of the Softmax
23:00 SigLip
26:30 Why a Contrastive Vision Encoder?
29:13 Vision Transformer
35:38 Coding SigLip
54:25 Batch Normalization, Layer Normalization
1:05:28 Coding SigLip (Encoder)
1:16:12 Coding SigLip (FFN)
1:20:45 Multi-Head Attention (Coding + Explanation)
2:15:40 Coding SigLip
2:18:30 PaliGemma Architecture review
2:21:19 PaliGemma input processor
80% of Engineering Isn't Coding
Next Up
80% of Engineering Isn't Coding
No Priors Podcast