GDPO Explained: NVIDIA Fixes GRPO for LLM Reinforcement Learning
NVIDIA recently introduced GDPO in a paper titled GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL
Optimization.
GDPO is a new reinforcement learning algorithm designed to fix GRPO’s limitations in multi-reward LLM training.
In this video, we explain how GDPO works, why standard GRPO fails with multiple rewards, and how reward-decoupled normalization improves advantage estimation and model performance.
Written Review - https://aipapersacademy.com/gdpo/
Paper - https://arxiv.org/abs/2601.05242
Code - https://github.com/NVlabs/GDPO
GRPO Deep Dive - https://ai…
Watch on YouTube ↗
(saves to browser)
Chapters (6)
Introduction
1:51
GRPO Recap
3:30
Multi-Reward GRPO
4:30
GRPO Reward Collapse
6:00
GDPO's Fix
7:26
GDPO Results
DeepCamp AI