GRPO 2.0? DAPO LLM Reinforcement Learning Explained

AI Papers Academy · Beginner ·📄 Research Papers Explained ·1y ago
In this video, we break down DAPO: An Open-Source LLM Reinforcement Learning System at Scale — a new research paper from ByteDance that introduces DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a powerful reinforcement learning (RL) algorithm built on GRPO (Grouped Relative Policy Optimization). DAPO tackles key challenges in training large language models (LLMs) with RL, especially issues encountered when trying to reproduce DeepSeek-R1’s results. The researchers trained Qwen2.5-32B with DAPO, achieving 50 points on the challenging AIME 2024 benchmark — outperforming DeepSeek-R1's 47 points while using only 50% of the training steps. Written Review - https://aipapersacademy.com/dapo/ Paper - https://arxiv.org/abs/2503.14476 Code & Dataset - https://github.com/BytedTsinghua-SIA/DAPO #ai #reinforcementlearning #llm #deepseek #grpo #dapo #rl #airesearch ___________________ 🔔 Subscribe for more AI paper reviews! 📩 Join the newsletter → https://aipapersacademy.com/newsletter/ Patreon - https://www.patreon.com/aipapersacademy The video was edited using VideoScribe - https://tidd.ly/44TZEiX ___________________ Chapters: 0:00 Introduction 2:30 Introducing DAPO 5:05 Clip-Higher 7:45 Dynamic Sampling 9:35 Token-Level Loss 11:13 Overlong Responses 12:23 Ablation Study 12:57 KL Divergence Removal
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

The ABCs of reading medical research and review papers these days
Learn to critically evaluate medical research papers by accepting nothing at face value, believing no one blindly, and checking everything
Medium · LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Learn to manage research paper tabs efficiently and apply meta-research techniques to improve productivity
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Learn to set up a Karpathy-style wiki for your research field to organize and share knowledge effectively
Medium · AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
Scientific knowledge may be stuck in a local minimum, hindering optimal progress, and understanding this concept is crucial for advancing research
ArXiv cs.AI

Chapters (8)

Introduction
2:30 Introducing DAPO
5:05 Clip-Higher
7:45 Dynamic Sampling
9:35 Token-Level Loss
11:13 Overlong Responses
12:23 Ablation Study
12:57 KL Divergence Removal
Up next
Microsoft Research Forum | Season 2, Episode 4
Microsoft Research
Watch →