Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation
Full coding of a Multimodal (Vision) Language Model from scratch using only Python and PyTorch.
We will be coding the PaliGemma Vision Language Model from scratch while explaining all the concepts behind it:
- Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax)
- Vision Transformer model
- Contrastive learning (CLIP, SigLip)
- Numerical stability of the Softmax and the Cross Entropy Loss
- Rotary Positional Embedding
- Multi-Head Attention
- Grouped Query Attention
- Normalization layers (Batch, Layer and RMS)
- KV-Cache (prefilling …
Watch on YouTube ↗
(saves to browser)
Chapters (14)
Introduction
5:52
Contrastive Learning and CLIP
16:50
Numerical stability of the Softmax
23:00
SigLip
26:30
Why a Contrastive Vision Encoder?
29:13
Vision Transformer
35:38
Coding SigLip
54:25
Batch Normalization, Layer Normalization
1:05:28
Coding SigLip (Encoder)
1:16:12
Coding SigLip (FFN)
1:20:45
Multi-Head Attention (Coding + Explanation)
2:15:40
Coding SigLip
2:18:30
PaliGemma Architecture review
2:21:19
PaliGemma input processor
DeepCamp AI