Is LLM Self-Reflection Real or Just Emergent Noise?
I asked Zichen Liu, first author of Dr. GRPO, whether self-reflection in LLMs actually improves reasoning or if it's just noise.
Their experiment on the DeepSeek V3 base model found no positive correlation between accuracy and the number of self-reflection instances. To measure self-reflection they used a hybrid approach: rule-based keyword matching ("re-check," "re-think," "let me verify") combined with an LLM-as-judge to catch implicit reflection behaviors.
The results challenge assumptions about what's really driving test-time scaling gains, but to be 100% I'm still suspicious about sel…
Watch on YouTube ↗
(saves to browser)
DeepCamp AI