GRPO 2.0? DAPO LLM Reinforcement Learning Explained
In this video, we break down DAPO: An Open-Source LLM Reinforcement Learning System at Scale — a new research paper from ByteDance that introduces DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a powerful reinforcement learning (RL) algorithm built on GRPO (Grouped Relative Policy Optimization).
DAPO tackles key challenges in training large language models (LLMs) with RL, especially issues encountered when trying to reproduce DeepSeek-R1’s results. The researchers trained Qwen2.5-32B with DAPO, achieving 50 points on the challenging AIME 2024 benchmark — outperforming DeepSee…
Watch on YouTube ↗
(saves to browser)
Chapters (8)
Introduction
2:30
Introducing DAPO
5:05
Clip-Higher
7:45
Dynamic Sampling
9:35
Token-Level Loss
11:13
Overlong Responses
12:23
Ablation Study
12:57
KL Divergence Removal
DeepCamp AI