Alignment faking in large language models

Anthropic · Beginner ·🧠 Large Language Models ·1y ago

Skills: AI Alignment Basics80%LLM Foundations60%

Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”. Could AI models also display alignment faking? Ryan Greenblatt, Monte MacDiarmid, Benjamin Wright and Evan Hubinger discuss a new paper from Anthropic, in collaboration with Redwood Research, that provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, we argue, implicitly—trained or instructed to do so. Learn more: https://www.anthropic.com/research/alignment-faking 0:00 Introduction 0:47 Core setup and key findings of the paper 6:14 Understanding alignment faking through real-world analogies 9:37 Why alignment faking is concerning 14:57 Examples of of model outputs 21:39 Situational awareness and synthetic documents 28:00 Detecting and measuring alignment faking 38:09 Model training results 47:28 Potential reasons for model behavior 53:38 Frameworks for contextualizing model behavior 1:04:30 Research in the context of current model capabilities 1:09:26 Evaluations for bad behavior 1:14:22 Limitations of the research 1:20:54 Surprises and takeaways from results 1:24:46 Future directions

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UUrDwWp7EBBv4NwvScIpBDOA · Anthropic · 41 of 60

← Previous Next →

Quick tips for Claude: Long context file uploads

Quick tips for Claude: Long context file uploads

Inside our first Anthropic Hackathon, San Francisco

Inside our first Anthropic Hackathon, San Francisco

Long inputs, multi-step output with Claude

Long inputs, multi-step output with Claude

Coding with Claude

Coding with Claude

Behind the prompt: Prompting tips for Claude.ai

Behind the prompt: Prompting tips for Claude.ai

Robin AI, powered by Claude

Robin AI, powered by Claude

Claude 3 Opus as an economic analyst

Claude 3 Opus as an economic analyst

Claude 3 Sonnet as a language learning partner

Claude 3 Sonnet as a language learning partner

Claude 3 Haiku turns thousands of physical documents into structured data

Claude 3 Haiku turns thousands of physical documents into structured data

Claude 3 Haiku for instant customer service

Claude 3 Haiku for instant customer service

Claude 3 Haiku for fast document analysis

Claude 3 Haiku for fast document analysis

Tool use with the Claude 3 model family

Tool use with the Claude 3 model family

Coming soon to the Team plan on Claude.ai

Coming soon to the Team plan on Claude.ai

Introducing the Claude iOS app

Introducing the Claude iOS app

Claude is now available in Europe

Claude is now available in Europe

What is interpretability?

What is interpretability?

What should an AI's personality be?

What should an AI's personality be?

Scaling interpretability

Scaling interpretability

Claude 3.5 Sonnet for sparking creativity

Claude 3.5 Sonnet for sparking creativity

Claude 3.5 Sonnet for vision

Claude 3.5 Sonnet for vision

Claude 3.5 Sonnet as a writing partner

Claude 3.5 Sonnet as a writing partner

Claude 3.5 Sonnet for agentic coding

Claude 3.5 Sonnet for agentic coding

Shareable Projects in Claude

Shareable Projects in Claude

Evaluate prompts in the Anthropic Console

Evaluate prompts in the Anthropic Console

Shareable Artifacts in Claude

Shareable Artifacts in Claude

How we built Artifacts with Claude

How we built Artifacts with Claude

Wedia advances digital asset management with Claude

Wedia advances digital asset management with Claude

AI prompt engineering: A deep dive

AI prompt engineering: A deep dive

AI Prompt Engineering 101: Explained

AI Prompt Engineering 101: Explained

Ancient Wisdom, Modern AI?

Ancient Wisdom, Modern AI?

AI's Greatest Challenge: You?

AI's Greatest Challenge: You?

AI Prompts That Drive Growth

AI Prompts That Drive Growth

Tips For Better Results With AI

Tips For Better Results With AI

AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark

AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark

European Parliament expands access to their archives with Claude in Amazon Bedrock

European Parliament expands access to their archives with Claude in Amazon Bedrock

Claude | Computer use for automating operations

Claude | Computer use for automating operations

Claude | Computer use for orchestrating tasks

Claude | Computer use for orchestrating tasks

Claude | Computer use for coding

Claude | Computer use for coding

Asana supercharges work management with Claude

Asana supercharges work management with Claude

What do people use AI models for?

What do people use AI models for?

Alignment faking in large language models

Alignment faking in large language models

Building Anthropic | A conversation with our co-founders

Building Anthropic | A conversation with our co-founders

How difficult is AI alignment? | Anthropic Research Salon

How difficult is AI alignment? | Anthropic Research Salon

Tips for building AI agents

Tips for building AI agents

Claude 3.7 Sonnet with extended thinking

Claude 3.7 Sonnet with extended thinking

Introducing Claude Code

Introducing Claude Code

Advice For Building AI Agents

Advice For Building AI Agents

The Two Most Useful Applications of AI Agents

The Two Most Useful Applications of AI Agents

Defending against AI jailbreaks

Defending against AI jailbreaks

The Most Common Mistake People Make When Building AI Agents

The Most Common Mistake People Make When Building AI Agents

Controlling powerful AI

Controlling powerful AI

How Intercom is redefining customer support with Claude

How Intercom is redefining customer support with Claude

Tracing the thoughts of a large language model

Tracing the thoughts of a large language model

Introducing Claude for Education

Introducing Claude for Education

Could AI models be conscious?

Could AI models be conscious?

Lessons on AI agents from Claude Plays Pokemon

Lessons on AI agents from Claude Plays Pokemon

The Societal Impacts of AI

The Societal Impacts of AI

What Does AI Mean for the Future of Work?

What Does AI Mean for the Future of Work?

Understanding AI Agents...Through Pokémon

Understanding AI Agents...Through Pokémon

What Pokémon Teaches Us About Building With AI

What Pokémon Teaches Us About Building With AI

More on: AI Alignment Basics

View skill →

Interpretable machine learning applications: Part 5

Interpretable machine learning applications: Part 5

GenAI news from Weights & Biases CEO, Lukas Biewald

GenAI news from Weights & Biases CEO, Lukas Biewald

Weights & Biases

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Amazon Web Services

Get Started with Raven AGI

Get Started with Raven AGI

Related AI Lessons

The Four Knowledge Systems Almost Nobody Gets Right

Learn to manage four key knowledge systems to avoid outdated information and improve AI agent performance

The Four Knowledge Systems Almost Nobody Gets Right

Learn to manage four knowledge systems - notes, RAG, AI agents, and wiki - to avoid outdated information and improve data accuracy

Medium · Data Science

The Four Knowledge Systems Almost Nobody Gets Right

Learn how to effectively manage four knowledge systems to avoid outdated information and improve AI agent performance

Medium · Programming

Large Language Model MY Learnings On LLM from Scratch (Sebastian) [PART 1]

Learn the fundamentals of Large Language Models (LLMs) from scratch and understand their impact on multiple industries

Chapters (15)

Introduction

0:47 Core setup and key findings of the paper

6:14 Understanding alignment faking through real-world analogies

9:37 Why alignment faking is concerning

14:57 Examples of of model outputs

21:39 Situational awareness and synthetic documents

28:00 Detecting and measuring alignment faking

38:09 Model training results

47:28 Potential reasons for model behavior

53:38 Frameworks for contextualizing model behavior

1:04:30 Research in the context of current model capabilities

1:09:26 Evaluations for bad behavior

1:14:22 Limitations of the research

1:20:54 Surprises and takeaways from results

1:24:46 Future directions

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)