GRPO Reinforcement Learning Explained (DeepSeekMath Paper)
In this video, we dive deep into the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", which introduces GRPO (Group Relative Policy Optimization)—a novel reinforcement learning (RL) algorithm used to train DeepSeek-R1.
DeepSeekMath is a model by DeepSeek designed specifically to excel at mathematical reasoning. We walk through its full training process, which closely mirrors how general-purpose large language models (LLMs) are trained. One of the key stages in this pipeline is reinforcement learning using GRPO.
Since GRPO builds upon PPO (Proximal Po…
Watch on YouTube ↗
(saves to browser)
Chapters (6)
Introduction
1:35
Math Pre-Training
4:55
Instruction-Tuning
5:45
PPO
7:45
GRPO
9:35
GRPO Objective
DeepCamp AI