Defending against AI jailbreaks
Anthropic researchers, Mrinank Sharma, Jerry Wei, Ethan Perez and Meg Tong discuss a system based on Constitutional Classifiers that guards models against jailbreaks.
Read more: https://www.anthropic.com/news/constitutional-classifiers
0:00 Introduction
0:39 Defining jailbreaks and their importance
3:35 Universal jailbreaks
10:24 The Swiss cheese model for safety
11:25 Explaining Constitutional Classifiers
14:11 Ensuring model helpfulness
17:30 Understanding the constitution and synthetic data
19:00 Flexibility of the constitutional approach
24:15 Origins of the constitutional classifiers …
Watch on YouTube ↗
(saves to browser)
Chapters (17)
Introduction
0:39
Defining jailbreaks and their importance
3:35
Universal jailbreaks
10:24
The Swiss cheese model for safety
11:25
Explaining Constitutional Classifiers
14:11
Ensuring model helpfulness
17:30
Understanding the constitution and synthetic data
19:00
Flexibility of the constitutional approach
24:15
Origins of the constitutional classifiers approach
32:24
Progress on robustness
38:47
The public demo: Purpose, setup
47:42
Understanding whether the approach is safe in practice
54:05
The public demo: Approaches people tried to bypass classifiers
56:14
Benefits of the classifier approach for Claude users
1:00:18
Memorable moments from the project
1:08:20
Differences in approach between this project and other research
1:11:11
The evolution of AI safety research
Playlist
Playlist UUrDwWp7EBBv4NwvScIpBDOA · Anthropic · 49 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
▶
50
51
52
53
54
55
56
57
58
59
60
Quick tips for Claude: Long context file uploads
Anthropic
Inside our first Anthropic Hackathon, San Francisco
Anthropic
Long inputs, multi-step output with Claude
Anthropic
Coding with Claude
Anthropic
Behind the prompt: Prompting tips for Claude.ai
Anthropic
Robin AI, powered by Claude
Anthropic
Claude 3 Opus as an economic analyst
Anthropic
Claude 3 Sonnet as a language learning partner
Anthropic
Claude 3 Haiku turns thousands of physical documents into structured data
Anthropic
Claude 3 Haiku for instant customer service
Anthropic
Claude 3 Haiku for fast document analysis
Anthropic
Tool use with the Claude 3 model family
Anthropic
Coming soon to the Team plan on Claude.ai
Anthropic
Introducing the Claude iOS app
Anthropic
Claude is now available in Europe
Anthropic
What is interpretability?
Anthropic
What should an AI's personality be?
Anthropic
Scaling interpretability
Anthropic
Claude 3.5 Sonnet for sparking creativity
Anthropic
Claude 3.5 Sonnet for vision
Anthropic
Claude 3.5 Sonnet as a writing partner
Anthropic
Claude 3.5 Sonnet for agentic coding
Anthropic
Shareable Projects in Claude
Anthropic
Evaluate prompts in the Anthropic Console
Anthropic
Shareable Artifacts in Claude
Anthropic
How we built Artifacts with Claude
Anthropic
Wedia advances digital asset management with Claude
Anthropic
AI prompt engineering: A deep dive
Anthropic
AI Prompt Engineering 101: Explained
Anthropic
Ancient Wisdom, Modern AI?
Anthropic
AI's Greatest Challenge: You?
Anthropic
AI Prompts That Drive Growth
Anthropic
Tips For Better Results With AI
Anthropic
AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark
Anthropic
European Parliament expands access to their archives with Claude in Amazon Bedrock
Anthropic
Claude | Computer use for automating operations
Anthropic
Claude | Computer use for orchestrating tasks
Anthropic
Claude | Computer use for coding
Anthropic
Asana supercharges work management with Claude
Anthropic
What do people use AI models for?
Anthropic
Alignment faking in large language models
Anthropic
Building Anthropic | A conversation with our co-founders
Anthropic
How difficult is AI alignment? | Anthropic Research Salon
Anthropic
Tips for building AI agents
Anthropic
Claude 3.7 Sonnet with extended thinking
Anthropic
Introducing Claude Code
Anthropic
Advice For Building AI Agents
Anthropic
The Two Most Useful Applications of AI Agents
Anthropic
Defending against AI jailbreaks
Anthropic
The Most Common Mistake People Make When Building AI Agents
Anthropic
Controlling powerful AI
Anthropic
How Intercom is redefining customer support with Claude
Anthropic
Tracing the thoughts of a large language model
Anthropic
Introducing Claude for Education
Anthropic
Could AI models be conscious?
Anthropic
Lessons on AI agents from Claude Plays Pokemon
Anthropic
The Societal Impacts of AI
Anthropic
What Does AI Mean for the Future of Work?
Anthropic
Understanding AI Agents...Through Pokémon
Anthropic
What Pokémon Teaches Us About Building With AI
Anthropic
DeepCamp AI