GRPO Reinforcement Learning Explained (DeepSeekMath Paper)
Skills:
Research Methods90%
In this video, we dive deep into the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", which introduces GRPO (Group Relative Policy Optimization)—a novel reinforcement learning (RL) algorithm used to train DeepSeek-R1.
DeepSeekMath is a model by DeepSeek designed specifically to excel at mathematical reasoning. We walk through its full training process, which closely mirrors how general-purpose large language models (LLMs) are trained. One of the key stages in this pipeline is reinforcement learning using GRPO.
Since GRPO builds upon PPO (Proximal Policy Optimization), we first provide a high-level overview of PPO before diving into GRPO’s innovations and how it removes the need for a value model.
Paper - https://arxiv.org/abs/2402.03300
Written Review - https://aipapersacademy.com/deepseekmath-grpo/
___________________
🔔 Subscribe for more AI paper reviews!
📩 Join the newsletter → https://aipapersacademy.com/newsletter/
Become a patron - https://www.patreon.com/aipapersacademy
The video was edited using VideoScribe - https://tidd.ly/44TZEiX
___________________
Chapters:
0:00 Introduction
1:35 Math Pre-Training
4:55 Instruction-Tuning
5:45 PPO
7:45 GRPO
9:35 GRPO Objective
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Research Methods
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The ABCs of reading medical research and review papers these days
Medium · LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Medium · AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
ArXiv cs.AI
Chapters (6)
Introduction
1:35
Math Pre-Training
4:55
Instruction-Tuning
5:45
PPO
7:45
GRPO
9:35
GRPO Objective
🎓
Tutor Explanation
DeepCamp AI