GRPO Reinforcement Learning Explained (DeepSeekMath Paper)

AI Papers Academy · Beginner ·📄 Research Papers Explained ·1y ago

Skills: Research Methods90%

In this video, we dive deep into the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", which introduces GRPO (Group Relative Policy Optimization)—a novel reinforcement learning (RL) algorithm used to train DeepSeek-R1. DeepSeekMath is a model by DeepSeek designed specifically to excel at mathematical reasoning. We walk through its full training process, which closely mirrors how general-purpose large language models (LLMs) are trained. One of the key stages in this pipeline is reinforcement learning using GRPO. Since GRPO builds upon PPO (Proximal Policy Optimization), we first provide a high-level overview of PPO before diving into GRPO’s innovations and how it removes the need for a value model. Paper - https://arxiv.org/abs/2402.03300 Written Review - https://aipapersacademy.com/deepseekmath-grpo/ ___________________ 🔔 Subscribe for more AI paper reviews! 📩 Join the newsletter → https://aipapersacademy.com/newsletter/ Become a patron - https://www.patreon.com/aipapersacademy The video was edited using VideoScribe - https://tidd.ly/44TZEiX ___________________ Chapters: 0:00 Introduction 1:35 Math Pre-Training 4:55 Instruction-Tuning 5:45 PPO 7:45 GRPO 9:35 GRPO Objective

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

NVIDIA cuOpt Wins the 2025 COIN-OR Cup

NVIDIA cuOpt Wins the 2025 COIN-OR Cup

NVIDIA Developer

Framework for Data Collection and Analysis

Framework for Data Collection and Analysis

Related AI Lessons

The ABCs of reading medical research and review papers these days

Learn to critically evaluate medical research papers by accepting nothing at face value, believing no one blindly, and checking everything

#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.

Learn to manage research paper tabs efficiently and apply meta-research techniques to improve productivity

How to Set Up a Karpathy-Style Wiki for Your Research Field

Learn to set up a Karpathy-style wiki for your research field to organize and share knowledge effectively

The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap

Scientific knowledge may be stuck in a local minimum, hindering optimal progress, and understanding this concept is crucial for advancing research

Chapters (6)

Introduction

1:35 Math Pre-Training

4:55 Instruction-Tuning

5:45 PPO

7:45 GRPO

9:35 GRPO Objective

Microsoft Research Forum | Season 2, Episode 4

Microsoft Research