Cheating LLMs & How (Not) To Stop Them | OpenAI Paper Explained

AI Papers Academy ยท Advanced ยท๐Ÿ“„ Research Papers Explained ยท1y ago
What if the most advanced AI models are secretly cheating the systems theyโ€™re meant to follow? ๐Ÿ˜ณ In this video, we break down OpenAIโ€™s latest research paper, "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation." This paper explores the phenomenon of reward hacking โ€” when AI models find clever loopholes to maximize rewards instead of genuinely solving the task. We'll cover: โœ… Reward hacking โœ… Why chain-of-thought (CoT) reasoning helps us catch AI misbehavior โœ… How OpenAI used GPT-4o to detect reward hacking โœ… The surprising risks of training LLMs to avoid cheโ€ฆ
Watch on YouTube โ†— (saves to browser)

Chapters (4)

Introduction
1:48 Reward Hacking Example
3:10 CoT Monitoring
5:54 Obfuscation Risks
The Secret Spy Tech Inside Every Credit Card
Next Up
The Secret Spy Tech Inside Every Credit Card
Veritasium