Alignment faking in large language models
Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”.
Could AI models also display alignment faking?
Ryan Greenblatt, Monte MacDiarmid, Benjamin Wright and Evan Hubinger discuss a new paper from Anthropic, in collaboration with Redwood Research, that provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, we argue, implicitly—trained or instructed to do so.
Learn more: https://www.anthropi…
Watch on YouTube ↗
(saves to browser)
Chapters (15)
Introduction
0:47
Core setup and key findings of the paper
6:14
Understanding alignment faking through real-world analogies
9:37
Why alignment faking is concerning
14:57
Examples of of model outputs
21:39
Situational awareness and synthetic documents
28:00
Detecting and measuring alignment faking
38:09
Model training results
47:28
Potential reasons for model behavior
53:38
Frameworks for contextualizing model behavior
1:04:30
Research in the context of current model capabilities
1:09:26
Evaluations for bad behavior
1:14:22
Limitations of the research
1:20:54
Surprises and takeaways from results
1:24:46
Future directions
Playlist
Playlist UUrDwWp7EBBv4NwvScIpBDOA · Anthropic · 41 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
▶
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Quick tips for Claude: Long context file uploads
Anthropic
Inside our first Anthropic Hackathon, San Francisco
Anthropic
Long inputs, multi-step output with Claude
Anthropic
Coding with Claude
Anthropic
Behind the prompt: Prompting tips for Claude.ai
Anthropic
Robin AI, powered by Claude
Anthropic
Claude 3 Opus as an economic analyst
Anthropic
Claude 3 Sonnet as a language learning partner
Anthropic
Claude 3 Haiku turns thousands of physical documents into structured data
Anthropic
Claude 3 Haiku for instant customer service
Anthropic
Claude 3 Haiku for fast document analysis
Anthropic
Tool use with the Claude 3 model family
Anthropic
Coming soon to the Team plan on Claude.ai
Anthropic
Introducing the Claude iOS app
Anthropic
Claude is now available in Europe
Anthropic
What is interpretability?
Anthropic
What should an AI's personality be?
Anthropic
Scaling interpretability
Anthropic
Claude 3.5 Sonnet for sparking creativity
Anthropic
Claude 3.5 Sonnet for vision
Anthropic
Claude 3.5 Sonnet as a writing partner
Anthropic
Claude 3.5 Sonnet for agentic coding
Anthropic
Shareable Projects in Claude
Anthropic
Evaluate prompts in the Anthropic Console
Anthropic
Shareable Artifacts in Claude
Anthropic
How we built Artifacts with Claude
Anthropic
Wedia advances digital asset management with Claude
Anthropic
AI prompt engineering: A deep dive
Anthropic
AI Prompt Engineering 101: Explained
Anthropic
Ancient Wisdom, Modern AI?
Anthropic
AI's Greatest Challenge: You?
Anthropic
AI Prompts That Drive Growth
Anthropic
Tips For Better Results With AI
Anthropic
AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark
Anthropic
European Parliament expands access to their archives with Claude in Amazon Bedrock
Anthropic
Claude | Computer use for automating operations
Anthropic
Claude | Computer use for orchestrating tasks
Anthropic
Claude | Computer use for coding
Anthropic
Asana supercharges work management with Claude
Anthropic
What do people use AI models for?
Anthropic
Alignment faking in large language models
Anthropic
Building Anthropic | A conversation with our co-founders
Anthropic
How difficult is AI alignment? | Anthropic Research Salon
Anthropic
Tips for building AI agents
Anthropic
Claude 3.7 Sonnet with extended thinking
Anthropic
Introducing Claude Code
Anthropic
Advice For Building AI Agents
Anthropic
The Two Most Useful Applications of AI Agents
Anthropic
Defending against AI jailbreaks
Anthropic
The Most Common Mistake People Make When Building AI Agents
Anthropic
Controlling powerful AI
Anthropic
How Intercom is redefining customer support with Claude
Anthropic
Tracing the thoughts of a large language model
Anthropic
Introducing Claude for Education
Anthropic
Could AI models be conscious?
Anthropic
Lessons on AI agents from Claude Plays Pokemon
Anthropic
The Societal Impacts of AI
Anthropic
What Does AI Mean for the Future of Work?
Anthropic
Understanding AI Agents...Through Pokémon
Anthropic
What Pokémon Teaches Us About Building With AI
Anthropic
DeepCamp AI