Cheating LLMs & How (Not) To Stop Them | OpenAI Paper Explained
What if the most advanced AI models are secretly cheating the systems theyโre meant to follow? ๐ณ In this video, we break down OpenAIโs latest research paper, "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation."
This paper explores the phenomenon of reward hacking โ when AI models find clever loopholes to maximize rewards instead of genuinely solving the task.
We'll cover:
โ
Reward hacking
โ
Why chain-of-thought (CoT) reasoning helps us catch AI misbehavior
โ
How OpenAI used GPT-4o to detect reward hacking
โ
The surprising risks of training LLMs to avoid cheating
Paper - https://openai.com/index/chain-of-thought-monitoring/
Written Review - https://aipapersacademy.com/cheating-llms/
___________________
๐ Subscribe for more AI paper reviews!
๐ฉ Join the newsletter โ https://aipapersacademy.com/newsletter/
Patreon - https://www.patreon.com/aipapersacademy
The video was edited using VideoScribe - https://tidd.ly/44TZEiX
___________________
Chapters:
0:00 Introduction
1:48 Reward Hacking Example
3:10 CoT Monitoring
5:54 Obfuscation Risks
Watch on YouTube โ
(saves to browser)
Sign in to unlock AI tutor explanation ยท โก30
More on: Reading ML Papers
View skill โRelated AI Lessons
โก
โก
โก
โก
The ABCs of reading medical research and review papers these days
Medium ยท LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Medium ยท AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
ArXiv cs.AI
Chapters (4)
Introduction
1:48
Reward Hacking Example
3:10
CoT Monitoring
5:54
Obfuscation Risks
๐
Tutor Explanation
DeepCamp AI