GRPO Reinforcement Learning Explained (DeepSeekMath Paper)

AI Papers Academy · Beginner ·📄 Research Papers Explained ·11mo ago
In this video, we dive deep into the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", which introduces GRPO (Group Relative Policy Optimization)—a novel reinforcement learning (RL) algorithm used to train DeepSeek-R1. DeepSeekMath is a model by DeepSeek designed specifically to excel at mathematical reasoning. We walk through its full training process, which closely mirrors how general-purpose large language models (LLMs) are trained. One of the key stages in this pipeline is reinforcement learning using GRPO. Since GRPO builds upon PPO (Proximal Po…
Watch on YouTube ↗ (saves to browser)

Chapters (6)

Introduction
1:35 Math Pre-Training
4:55 Instruction-Tuning
5:45 PPO
7:45 GRPO
9:35 GRPO Objective
The Secret Spy Tech Inside Every Credit Card
Next Up
The Secret Spy Tech Inside Every Credit Card
Veritasium