Cheating LLMs & How (Not) To Stop Them | OpenAI Paper Explained
What if the most advanced AI models are secretly cheating the systems theyโre meant to follow? ๐ณ In this video, we break down OpenAIโs latest research paper, "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation."
This paper explores the phenomenon of reward hacking โ when AI models find clever loopholes to maximize rewards instead of genuinely solving the task.
We'll cover:
โ
Reward hacking
โ
Why chain-of-thought (CoT) reasoning helps us catch AI misbehavior
โ
How OpenAI used GPT-4o to detect reward hacking
โ
The surprising risks of training LLMs to avoid cheโฆ
Watch on YouTube โ
(saves to browser)
Chapters (4)
Introduction
1:48
Reward Hacking Example
3:10
CoT Monitoring
5:54
Obfuscation Risks
DeepCamp AI