LLM Quantization Explained: GPTQ, AWQ, QLoRA, GGUF and More

Tales Of Tensors · Beginner ·🧠 Large Language Models ·2w ago
00:00 Introduction to LLM Quantization 02:15 What is Quantization? 04:45 Post-Training Quantization (PTQ) vs. QAT 07:30 GPTQ (Post-Training Quantization for GPT) 11:12 AWQ (Activation-aware Weight Quantization) 14:20 QLoRA (Quantized Low-Rank Adaptation) 18:05 GGUF and llama.cpp 22:30 EXL2 (ExLlamaV2) 25:50 How to Choose the Right Format 28:40 The Future of Quantization What does it take to run a massive large language model on consumer hardware? In this video, we break down LLM quantization from first principles and show how techniques like INT8, INT4, GPTQ, AWQ, NF4, QLoRA, GGUF, SmoothQuan…
Watch on YouTube ↗ (saves to browser)

Chapters (10)

Introduction to LLM Quantization
2:15 What is Quantization?
4:45 Post-Training Quantization (PTQ) vs. QAT
7:30 GPTQ (Post-Training Quantization for GPT)
11:12 AWQ (Activation-aware Weight Quantization)
14:20 QLoRA (Quantized Low-Rank Adaptation)
18:05 GGUF and llama.cpp
22:30 EXL2 (ExLlamaV2)
25:50 How to Choose the Right Format
28:40 The Future of Quantization
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)