Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Umar Jamil · Beginner ·📐 ML Fundamentals ·1y ago

Skills: LLM Engineering90%ML Pipelines80%

Full coding of a Multimodal (Vision) Language Model from scratch using only Python and PyTorch. We will be coding the PaliGemma Vision Language Model from scratch while explaining all the concepts behind it: - Transformer model (Embeddings, Positional Encoding, Multi-Head Attention, Feed Forward Layer, Logits, Softmax) - Vision Transformer model - Contrastive learning (CLIP, SigLip) - Numerical stability of the Softmax and the Cross Entropy Loss - Rotary Positional Embedding - Multi-Head Attention - Grouped Query Attention - Normalization layers (Batch, Layer and RMS) - KV-Cache (prefilling and token generation) - Attention masks (causal and non-causal) - Weight tying - Top-P Sampling and Temperature and much more! All the topics will be explained using materials developed by me. For the Multi-Head Attention I have also drawn all the tensor operations that we do with the code so that we can have a visual representation of what happens under the hood. Repository with code and notes: https://github.com/hkproj/pytorch-paligemma Prerequisites: 1) Transformer explained: https://www.youtube.com/watch?v=bCz4OMemCcA 🚀🚀 Join Writer 🚀🚀 Writer is the full-stack generative AI platform for enterprises. We make it easy for organizations to deploy AI apps and workflows that deliver impactful ROI. We train our own models and we are looking for amazing researchers to join us! Did I already say we have plenty of GPUs? https://writer.com/company/careers/ Chapters 00:00:00 - Introduction 00:05:52 - Contrastive Learning and CLIP 00:16:50 - Numerical stability of the Softmax 00:23:00 - SigLip 00:26:30 - Why a Contrastive Vision Encoder? 00:29:13 - Vision Transformer 00:35:38 - Coding SigLip 00:54:25 - Batch Normalization, Layer Normalization 01:05:28 - Coding SigLip (Encoder) 01:16:12 - Coding SigLip (FFN) 01:20:45 - Multi-Head Attention (Coding + Explanation) 02:15:40 - Coding SigLip 02:18:30 - PaliGemma Architecture review 02:21:19 - PaliGemma input processor 02:40:56 - C

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Advanced AI and Machine Learning Techniques and Capstone

Advanced AI and Machine Learning Techniques and Capstone

Related AI Lessons

Day 19 Part 2: Hashtag Trends & Discovery + Buffer API Lifecycle Pattern Discovery

Learn to build a trend detector and discovery engine using machine learning to identify trending hashtags and related content

Medium · Machine Learning

How to Build a Professional Grade Calculator in C Language [Full Source Code Included]

Learn to build a professional-grade calculator in C, handling errors and performing multi-level calculations, to improve logic building and memory management skills

Beyond Overfitting: My Idea of a “Governor AI” That Supervises Learning Systems

Learn about a novel 'Governor AI' concept to supervise learning systems and prevent overfitting in machine learning

Medium · Machine Learning

Mixed Integer Goal Programming for Personalized Meal Optimization with User-Defined Serving Granularity

Learn how to apply Mixed Integer Goal Programming to optimize personalized meals with practical serving sizes, overcoming limitations of existing methods

Chapters (14)

Introduction

5:52 Contrastive Learning and CLIP

16:50 Numerical stability of the Softmax

23:00 SigLip

26:30 Why a Contrastive Vision Encoder?

29:13 Vision Transformer

35:38 Coding SigLip

54:25 Batch Normalization, Layer Normalization

1:05:28 Coding SigLip (Encoder)

1:16:12 Coding SigLip (FFN)

1:20:45 Multi-Head Attention (Coding + Explanation)

2:15:40 Coding SigLip

2:18:30 PaliGemma Architecture review

2:21:19 PaliGemma input processor

Reasoning Under Uncertainty