Tracing GRPO's Biased Objective Back to DeepSeek Math
Zichen Liu, author of Dr. GRPO, walks through where the length normalization term in the standard GRPO formulation originates — the DeepSeek Math paper's equation and the common implementation choice of averaging loss over the token axis instead of summing.
This biased formulation propagated through follow-up papers and major open-source libraries like TRL, OpenRLHF, and verl.
amazing man wow.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: AI Alignment Basics
View skill →
🎓
Tutor Explanation
DeepCamp AI