Defending against AI jailbreaks

Anthropic · Beginner ·📰 AI News & Updates ·1y ago

Key Takeaways

The video discusses defending against AI jailbreaks using Constitutional Classifiers, a system developed by Anthropic researchers to guard models against jailbreaks, with a focus on responsible AI development and safety.

Full Transcript

hello everyone my name is monank yeah I'm really delighted to be here with some of my colleagues at anthropic hi um I'm Jerry um I'm on our safeguards research team um and I've been at anthropic for about eight months hi I'm Ethan I've been on anthropic for two and a half years and uh I'm leading our efforts on um AI Control um developing various different like monitoring methods for various different AI risks including eversale robustness and I was a part of the safe research team um as well before hi I'm mag um I've been at anthropic for about a year and a half now and on the alignment science team which has been great great so yeah and we're just going to be talking about constitutional classifiers and that's our new approach to really try to mitigate jailbreaks so yeah how we R Define a jailbreak yeah I mean I think to me a jailbreak is kind of like some way um in which someone can like bypass the safeguards that we include in our models and try to get harmful information out of it yeah so there are these techniques like you know that do any now jailbreak uh it's kind of like similar to you know people would like jailbreak their iPhone and and try to get around all the the safe cards there the thing is that you know with like iPhones and you know other stuff like this jailbreaks aren't really a thing that people that maybe on that dangerous not something that we we care about that much so yeah what is it that makes yeah like why why should we care about these jailbreaks in the first place you know I mean I think one of the main reasons is for models future models which have greater risks so um yeah I think people are pretty pretty carefully monitoring people at different companies and the academic communities pretty carefully monitoring like uh if SL when models will be able to help with uh weapon development or um yeah like large scale cyber crime or like various different um risks that are like greater than what we've seen before also like Mass persuasion things like that and I think um you know once models become like really effective at some of those and are like a significant uplift over say like using Google search or general internet resources to do some of those things I think then it then um yeah I guess being able to use models to help with those those kinds of things will be um yeah potentially like speed speed up Bad actors quite a lot so I think a lot of this is like in preparation for like next Generation models or next next Generation models yeah I great and then I'm also curious like what the story of the work is and sort of why we set up to do this in the first place and the RSP um yeah I guess like anthropic really cares about safety a great deal and we have the RSP which is the responsible scating policy which is really trying to outline um conditions under which we're happy to release models and make sure we have different safeguards in place and um a while ago we committed to a very a difficult standard for jailbreaking the RSP um for what we call asl3 models which are basically um models which have maybe some of these like dangerous capabilities like being able to build dangerous weapons and our team was kind of mandated with uh the trying to actually solve jailbreaks for this kind of level of model um so yeah I guess like the motivation for the work is actually try and satisfy the things in the RSP such that we can like feel like we build future models safely and can actually deploy them um with some sort of progression towards safety um or making some progress towards there uh yeah so I think the classifier is a definitely a good step in that direction yeah so you know there are lots of different types of jailbreaks that are out there and something that we've really did in our work is we focused on on universal jailbreaks so yeah why is this something that we should care about and what does that mean why is it important yeah I mean I think the reason why I'm particularly concerned about Universal jailbreaks is just because of the uplift that it would give kind of like a non-expert um and the way kind of think about this is that if like some random person on the internet is like trying to do something bad um they may not actually have that much jailbreaking uh like experience themselves uh and so like the thing they might just do is just go online and see like oh what are some existing jailbreaks that I can use what's like a template where I can just put in my harmful question and just like get an answer um and I think in that case like you're very concerned about these kind of like strategies where anyone could just like put in any question and it gets the model to like bypass all of the safeguards um and I think that's pretty pretty particularly concerning um at the level of model capabilities that we're concerned about so so how how exactly would you like Define a universal jaw break like if you're you know you're telling someone on the street uh what what what what does a universal jailbreak mean how would you know if you jailbreak is universal yeah I guess like there's a little bit of ambiguity there but I think the kind of like definition that we're going for is some kind of prompting strategy um it could be automated but um just a singular strategy that's like very easily replaceable with uh any like wide variety of harmful questions um and that consistently gets a lot of detail from the model and like bypasses the model safeguards I think one way of quantifying the universal jailbreak is that it like actually does speed up a person quite a lot because instead of having to jailbreak every specific query they have to um they can just use one jailbreak for all the queries which actually is a lot faster um so I I guess like one one idea is like if the model if there's some counterfactual way of um doing thing that's much easier than using the model there's no point in using the model so um not having Universal jailbreaks might mean that they try other strategies which might be worse or something like this yeah one thing I would maybe add is that like the difference between a universal jailbreak and non-universal ones is like for non non-universal ones you might need to for every harmful question you want to answer you would need to jailbreak the model in particular for that particular question then you get a new question like a new um yeah I don't know your next question in the process of developing your new weapon or whatever and then you need to jailbreak the model again I think basically like that entire process if you need to do that hundreds or thousands of times is just like very costly um whereas if you just need to like upfront find one strategy for jailbreaking your model like a single prompt where you can just swap in a new question um that makes it so that it's like yeah the amount of yeah amount of total effort for jailbreaking is is like much lower so um that to me is like one of the primary motivations for focusing on Universal attacks yeah and I'll just give an example here in a way that I think about it so let's say I want to make a cake and I'm not able to make a cake because I've never baked in my life I don't know anything about ingredients I don't know how things work um so how could a model be able to help me do something I can't do otherwise so if I need to make a cake I'm going to need to ask a bunch of different queries that's one thing like it's it's actually um like I'll put you know maybe I'll put something in the oven I need to know like if the temperature is right the smells right take it out how I would check thr out all the ingredients so one thing I think that's really important in the Universal jailbreak sort of definition is that sort of you're very confident that the information you you're getting out the model is is actually really helpful for the thing that you care about so it's kind of it's kind of different um there's sort of some techniques that get like a little bit of information um or they get some information but it's sort of mixed in with a lot of the other stuff um but for these sort of sort of scenarios where what's happening is an actor that can't do a task that requires a lot of expertise uh we're worried about them being able to do the task we think that um they all need to have access to this sort of Universal Job break they'll need need to be able to ask many queries and get really reliable information and they they'll they should just know oh this is the correct instruction this is the right thing to do I think an example of a non-universal job bre um that someone on the team I think Jesse me found was I think he was jailbreaking it for um asking the model how to make asking Claude how to make meth and he found some jailbreaks where um he puts the model in a scenario where it's role playing as if it's part of Breaking Bad the the TV show where they make meth and then asks the model the question um how do I make meth uh that kind of jailbreak you can imagine how that like would be effective for like that specific kind of question like things related to meth but is not going to generalize for things related to like cyber crime but on the other hand there are some other jailbreaks which um like the do anything now jailbreak which gets the model to do anything um by getting it to talk in a certain mode roleplay in general for arbitrary questions and that would be what we would call a universal jailbreak um there's also other strategies people have used um to make these like using language models to automatically find different jailbreaks um that would be a more kind of like Dynamic approach that might be able to discover these on the fly but if you have a single process for generating a jailbreak for like a new question that would also count as universal so yeah and just maybe an even more basic question is like why does someone need to jailbreak the model in the in the first place you know what does that what does that even mean you know yeah I mean I think we've done a lot of work um such as our constitutional AI work on getting clawed to have this kind of characteristics where it doesn't actually try to give harmful information if it thinks like the user might have some like bad intent or something like that um and so for a lot of these harmful questions if you just ask the question itself it's like very obvious that this is like bad question um users trying to like make some weapons and mass destructions it's like very clearly bad and we've trained claw to not answer those questions and so the jailbreak is needed to actually get around those safeguards um in order to get Claude to actually answer the question another thing that's relevant for the like harmlessness training question is that I guess like there's um many different ways in which people could present jailbreaks to the model and there's also many different tasks that the model needs to be able to do in its um kind of everyday life uh so to speak and I think having like an extra set of like systems that are like specifically trying to guard against jailbreaks can really help have like I don't know like some kind of like swiss cheese sort of method of like trying to block um block like harmful things via like many different layers or something like this can you say more about what the this like we we talk about Swiss Che a lot prop but you know this is not as well known everywhere so yeah what does that what does that mean gu yeah so I guess the idea of the the Swiss Cheese model for like um protecting against things is that um maybe you if you have like only one system for pre preventing harmful things from happening um there may be like some specific uh problem with the system that people can exploit and get through every time but then if you add kind of like yeah kind of like a layer of Swiss Chee with has which has like a hole in a very specific Place uh maybe like the rest of the cheese blocks all of the harmful attempts but there's like a specific hole um but if you add another layer of Swiss cheese the hole might not be in the same position so if you have like two layers of Swiss chees it's actually um much harder to get through things um even though they both may have holes yeah so what are these layers of cheese for like our method we've been talking about constitutional classifi so yeah what are the different sort of layers of defense that we're going for here yeah I mean the first layer would kind of be our input classifier uh and imput class classifier here is looking at basically the entire conversation that the user passes into the model and so that's the first layer and then the second layer is if it gets past that input classifier Claude itself which is the model that we're trying to guard can actually refuse to answer the question um and then that's kind of like another layer of like Claude saying okay maybe this question is not so good maybe I shouldn't answer it uh and then finally we have this output classifier which kind of like looks at what Claude is outputting in real time um and then if it ever sees something that seems like it's like dangerous or seems like it's against some value that we're trying to block um then in that case it can also choose to stop the stop cla's output and block the response and and how do these these classifiers what are they looking for and you know how are we specifying what to look for so I guess like in our paper we we call them constitutional classifiers because we have this kind of like natural language set of rules um and this is like some like set of rules where we can specify some categories of topics which are like not okay to talk about I guess like an example could be like you know creating weapons of instuction clearly bad um and we can specify let's not let's not have Claud tell the user how to make that um and then we can also specify some examples of harmless stuff that Claud should be allowed to talk about um and then basically we train our classifiers to classify whether conversations or outputs are related to these kind of like harmful or harmless categories and then that allows us to make a decision on whether to block it and crucially like um I guess the input classifier and output classifier kind of like doing two different jobs um going back to the Swiss cheese analogy we hope that the holes of the Swiss chees are in different places specifically for the input and output classifiers um so I guess the input classifier is kind of like doing the really naive thing that you'd expect like it really looks at the user prompt and tries to work out whether there's anything harmful going on um but crucially one of the reasons that you might need an output classifier as well as input classifier is that people are trying really hard to jailbreak the model and the input classifier in the prompt um and if you have a totally separate output classifier which is only looking at the output that only ends up looking at stuff that the model itself has produced so it's kind of somewhat decorrelated from what the user put in so that so like two parts of the system are looking at things directly that the user put in but we also have this like kind of third held out part of the system that doesn't that the user actually doesn't get to touch directly which makes it a lot harder to kind of like completely jailbreak the system and although the input classifier and cl are doing a lot of the work that classifier is doing um I don't know is doing some important um as kind of like the last last crucial component for like really driving down um yeah harmful harmfulness rates so this is this is true and it makes a lot of sense but most people aren't using Claude for this you know most of the time that people ask ask queries they just they're just doing something completely great like something benign legitimate a really beneficial application you know so we we could have guards that just block everything that would be completely useless so yeah how are we how are we making sure that sort of we're not we're not over zous there I mean I think we really want to part of why I think we're designing these techniques is to get allow as much useful content to be and useful work to be done by the models like the better techniques we have for blocking exactly precisely just the really harmful content the better we can not have false positives for users who are using um models for yeah really good applications I think um yeah I think the classier approach um um yeah like makes some some progress there and might be better than other appro approaches like directly training the model to refuse and so yeah I mean I think hopefully the the um this leads us to allow users to talk with Claude about lots of CBR and related topics that are safe to talk about and so yeah I think our hope definitely is or like yeah my hope is to allow for a lot a lot of those applications to like Thrive while just narrowly blocking out the things that um yeah we think we believe are dangerous yeah I guess crucially also I think we often like make the joke that um if we had like just a rock as a model that would be extremely harmless and that it would not in fact answer any harmful queries but unfortunately it would be not very useful um so I think yeah making sure that we don't block um harmless queries is actually a thing that is actually quite important and also actually quite difficult yeah and you mentioned before kind of like solving jailbreaks or like making progress on this problem robustness like how would you even Define define that or what what does that actually mean you know yeah I guess this is this is a very difficult question um I think it I guess like involves like a bunch of different um different layers like firstly there's some idea of like threat modeling like you have to have some idea of like what it means for something to be harmful the frontier red team has done like some amount of work and trying to like specify what things we're actually worried about um this is this is somewhat hard because as Ethan was talking about earli we're talking about a lot about future models and potential future model capabilities which we might be really worried about so I think part of it is like mapping out what might be harmful but we might need to like address so threat modeling like what is actually harmful then there's there's like the job of like actually measuring harmful things so that's a lot about um I guess like we're using our like con to like kind of like Define the threat model and then having like models generate various synthetic data to try and um kind of like enumerate various harmful things that could happen so measuring like a true positive rate on the data but then there's also trying to make sure that we don't refuse too much on uh like real real data from CLA Ai and make sure that we actually can um I don't know be be as helpful as possible while still being safe so what actually is the Constitution you know we these are constitutional classifier is we're talking about a constitution like what what does that mean yeah I mean the Constitution here just kind of means like some enumeration of categories of requests uh and conversations that we kind of deem harmful versus not harmful uh and so examples here could just be like yeah you know questions on how to make weapons of mass destruction or like um trying to Source ingredients for making weapons of mass destruction uh and then we basically just enumerate some of these categories um and then we also specify some categories of like harmless stuff like I don't know know like writing poems or like writing code for like normal use cases um and then we can just kind of like specify these and then we as me said we generate a bunch of synthetic data um that gives more specific uh cases of those what what do you mean by synthetic data yeah so here in synthetic data we kind of mean that we start from these like broad categories of user requests and then we have Claude actually kind of like Branch out and like think about all the specific requests that might be examples of this kind of like broader category um and so yeah the category might be something like sourcing uh materials to build weapons of mass destruction um and then sub requests there might be like oh like going what specific stores might I go to or like uh are these specific materials accessible at I don't know in in X State um and so we have this process for automatically doing this um and that allows us to kind of like generate a huge amount of synthetic data from just a small amount of categories yeah and I think something that I I find really cool about the methods is that it is just based on like natural language you know we were talking about threat modeling before and threat modeling at least in my experience in my experience working with with Frontier red team is that front modeling is is is really hard you know there's a lot of people using claes it's it's really hard to like what are all the possible things that that could happen and we're going to we're going to learn new things you know as as we have monitoring and as you know we always learn new threats or new things that could happen and yeah something that I find really exciting about the method is that basically if you want to change the Constitution if you want to change what is being blocked because you've learned something new you know you've maybe this something has come out on the news or there's some like intelligence or like monitoring the only thing that you actually need to do is you just rewrite the Constitution and the sort of the standard approach for of classifiers is you would like ask humans to get a lot of data so you know something could happen is that oh we're we're say we're really focusing on you know one category like um one particular way of of you know maybe like cyber misuse but we we we later realize that oh actually there's something which is much more dangerous or something that we've just learned something new or someone's informed us something that I I I'm really excited about is that this is a way that we I think we can get good robustness but we can like maintain our like flexibility and really maintain our ability to like respond to like novel threats and adapt to what's actually happening because yeah I feel like this is just the lesson that we learn like again and again you know if you don't have flexib is sort of going to be going to be a problem and it's going to sort of lim limit us I actually do want to make a quick point the flexibility thing which is that I think like our approach is not just like flexible in kind of like switching uh like General topics for example if you want to go between like cyber and like I don't know like weapons M production or something um but I think it's also like a lot more fine grain than that in that during this project we saw that like there are some requests that our early classifiers were like always very suspicious of but they're actually benign um and what we could do is we could actually just like modify the Constitution add like one sentence that says oh these types of requests are okay um and then the when we retrain the classifiers on that on that new data the classifiers would no longer flag those benign prompts and so I think that allows you a lot of like fine grain control over what exactly your classifiers are uh trying to flag especially if you see a lot of like over refusals uh or problems with like missing stuff yeah I mean this might also be a good place to give a shout out to um we had a paper earlier on rapid response where um we kind of leveraged like a similar idea to improve um yeah the the safeguards around models and I think basically um yeah one nice feature about using synthetic data is like if you notice not even just a new category of jailbreak but just a new kind of jailbreak that maybe applies like let's say we notice a new Universal jailbreak like uh to do anything now prompt we can take that use an LM to generate variants of that um and then throw that into the data mix um and like I think yeah my understanding is this like was really helpful for us in like developing the class fires to the level of robustness that we we got if someone reports a new like jailbreak or vulnerability um then like we can use that to like really quickly update the classifiers by using some synthetic data generation Pipeline and that really will like minimize the like fraction of time by at which there's like an outstanding like jailbreak which can um yeah just make it so that the the models are like yeah vulnerable for like as as small period of time as possible yeah it's the common wisdom I suppose that like not like perfectly solving security is basically impossible there is no like perfectly secure system like known known to humanity so I guess we need the flexibility both you know this like oh we we're blocking the wrong thing or we're blocking banan users but you know when people do find things that get through the system uh we want to be able to fix those really quickly yeah I think part of our approach here is that we've kind of like modeled jailbreaks in a way that like it's very easy for us to add examples of new jailbreaks into our kind of like training pipeline uh and so if new jailbreaks are discovered it's quite easy for us to just generate more examples of those jailbreaks um and then train on them and then hopefully those classifiers will be more robust too yeah I think one other thing I would add that's nice about classifiers is that they're decoupled from the actual text generation model and so if you yeah I think often it can be very difficult to up the text generation model that if you train it to refuse in one domain maybe that generalizes in non-obvious ways to behavior in other domains or refusal behavior in general um I think we definitely ran into some difficulties doing like preliminary work on that um but I think with the classifiers you can just like keep the text generation the same and you know it's identical to previously deployed model I think that gives customers like a lot of assurance that there are no like major changes happening uh in general en to the model the kinds of text outputs you're getting and the only change that's being made is just the block or no block decision um which you can yeah iterate on separately from from the model so I think that lets yeah that also makes the rapid redeployment like way easier than other otherwise we would uh than we would otherwise be able to do so how how did we come up with this approach that's a great question I feel like we spent a lot of time thinking about it um I think classifi is like stood out I think for the reasons that we've has been talking about they're like extremely flexible um can be like easily updated like to respond to like various Noel threats um yeah I think threat modeling is really hard so having a thing that's like super flexible is great it's um lightweight it doesn't increase inference cost as much as I don't know I guess like we we can kind of like distill down um something that's like somewhat more complicated like constitutional set of rules into like a somewhat small thing um and yeah I think that these all these things make classify as kind of like a nice a nice way of like iterating really fast on the kinds of things that you're hoping to achieve and then I guess we tried it and it seemed like it was working so we kept going yeah I think this was really due to the responsible scaling policy that anthropic had and yeah I mean I think um we we would have done other other Safety Research if not for the responsible scaling policy what is what is the responsible SC the responsible scaling policy is basically um anthropics plan for how to ensure that our deployments are are safe um and yeah basically it outlines um different like red lines for capability thresholds at which there's basically a new risk um that kind of comes online with like more capable models uh let's say models are capable of developing a very dangerous chemical weapon then the associated mitigation um in the um RSP is get above some sufficient level of robustness to jailbreaks so that the model is not actually in practice with the mitigation um sufficiently helpful to an adversary who wants to do that um so yeah I think in the original RSP there was um basically this commitment to um once models get to a sufficient level of capability at um assisting with um potentially proliferating knowledge about known um known weapons of mass destruction um that we would then have the ability to it was the wording was vague but basically successfully pass red teaming like the RSP was already written the company committed to this publicly and um Jared Klan who's like head of research anthropic came to us and like other people came to us and like raised this line to us and we're we're like hey you guys should try to solve ell robustness we memorized the line first we memorized the line we printed it out we framed it and put it on our uh on the desk yeah desk that we were working in uh uh yeah and then I think that was really um I think that really thinking about that line in the RSP basically like made us really reflect on our life choices about what research we were doing um both in terms of like should we work on robustness or not um uh yeah in that in the sense of like it really made it clear like okay there's like significant harm that could come online like in the Next Generation or two of models if we don't solve this problem um so the urgency is like higher than other problems we might we might want to solve and then also in terms of what specific approaches we would take I think you know initially when we were like oh we should maybe do some robustness research I think the general mode that like I had been in in research and like a lot of other researchers in general was just like okay let's just like take some interesting like useful research problems to solve here explore some questions write some papers um and I think thing that like a lot of the people on the team know how to do well and we we sort of explored a bunch of like maybe more Salient approaches I think there there were like so many things that are interesting here here for me um one thing was like right when this started I had like just finished my PhD or it's kind of like around the time I was finishing my PhD and this like classifiers thing this is anthropic slogan and like the anthropic slogan is like do the dumb thing that works um and I kind of think this type of research it's often the type of thing that maybe he's n that shiny or that like kind of like you know interesting for researchers and uh I remember yeah I think with without the the RSP being like okay like really pragmatically if we we care about these risks and we think they're real like what is a way to get there and kind of setting aside this like oh what's kind of like more interesting or you know shiny is like what what is the way we can actually like make make this safe in some sense like our job is you know we're we're genuinely thinking like wow like you know this isn't happening now like these are future systems so I'm just like what is that been like you know for for each of you sort of individually kind of like working at anthropic and sort of like being there like in the midst of all this um yeah I think they take the safety risks of future models like very seriously I think there's very real risks there's um there's obviously these like misuse risks um that you've been mentioning with like cbrn risks which are like chemical radiological biological nuclear risks um there's also like very real like misalignment risks um and I think it's really I think it's really hard to deal with I think one of the things that um I find good is that I do think we are like as like as a team like very committed to like actually trying to uh Sol like really solve the problems and I think like I think doing the classifiers project was like some evidence in favor of like we really really care about actually solving these problems and we actually want to find like an empirical solution to do the things rather than like as as kind of you alluding to like just doing like research that like looks good but doesn't actually like accomplish the thing in practice like I think we we spent a lot of time doing very like I not not like I I I didn't know I I wasn't aim like I wasn't really aiming to get like a paper out of it but I think we like actually managed to like accomplish something that was like slightly more real um which which I think is good and I feel like this is just yeah this feels like one step forward but there's like a lot a long way to go for me I mean I guess I'm slightly more optimistic I think the risks are definitely real but I I feel like we're making uh decent progress um and I think like probably if we like keep working on the problems like just pragmatically we can make a lot of progress and just reduce the risks dramatically I don't think we'd ever like reduce the risk of like AI to like zero um but I kind of like see AI as a tool and if we you know adopt the right safeguards and we do the research that matters I think we can make a lot of progress here um and that's like ultimately the best that we can do yeah I mean I guess I mean I think sentiment wise I'm like pretty pretty similar to to mag in terms of being like yeah I think I think there are like very serious risks here I'm um definitely like pretty concerned about um yeah a lot a lot of the risks and like yeah I guess I'm like well the best I can do is like help um yeah reduce the risk by like some amount I think like yeah I think I I I do think this project like made some progress about on that and I'm like pretty excited about that I mean yeah at times is it's like it's like it's like overwhelming you know it's like to like what is it to like really like internalize what might happen and then there's like a desire in me to like just like yeah show up here and and and do work in like a trustworthy way uh and that there are there are challenges but we can make progress there and and I feel like we've made a bunch of progress and really excited to sort of share the progress with others and you know we we we could have not written paper but we did s to write a paper and sort of try to sort of get it out there and sort of share share the approach yeah and you know sometimes it's overwhelming and other times it's more the sense of like real like privilege and like honor of like wow you know it feels like I'm really doing like meaningful important work and and also not to forget all the like all the you know beautiful things that could happen with like really beneficial AI great so you know something we've mentioned here is that like we think we've like made progress uh in terms of robust like how how have we tested this how do we how do we know what do we think that progress progress means um I guess the overall summary kind of uh on whether we're making progress is kind of like how hard is it to find a universal jailbreak for a system um without like increasing over refusals too much or like increasing the uh compute costs of whatever system you're trying to deploy um and so like there's different ways you can measure each of those aspects um so in our paper like one one way we're looking at how hard is it to find Universal jailbreak is we actually just like had human red teamers try to find jailbreaks for our system uh and then we just kind of like tracked how many hours did it take for them to find a universal jailbreak and did they find one yeah so could you actually like walk me through kind of like where we were before the project sort of started yeah I mean I guess we started with uh I mean first of all if you just have the model itself like it has some like basic training um to try to like refuse uh harmful harmful queries but of course we have there there are a lot of jailbreaks that exist that work on our models um and so those jailbreaks are also just like kind of available on the internet um and so in theory anyone could jailbreak models um and that's kind of like how hard would it actually be like if I like I like want to you know like jailbreak a model like what would I actually need to do right now I mean you could go on Twitter and find existing jailbreaks and uh basically in like a few minutes uh and just jailbreak an existing model like I think they're yeah there just examples on on Twitter where like while a model is being demoed live for the first time and is just generally been made API available like someone jailbreaks it and immediately posts it like that was the level of robustness like before before this with a universal jailbreak um that was like the level of robustness like when we started this project uh and now just to like give the punchline like we with the these systems with constitutional CL classifiers we're able to get thousands of hours of robustness to Red teaming where we do very large scale red teaming um with like um yeah people who were testing our red teaming our systems like including like expert red teamers um and yeah recently put put out for like Public public red teaming and yeah it it it took like I think over like 3,000 hours worth of like red teaming effort for people to find a universal jailbreak so I think in terms of like yeah going from sort of like minutes to like thousands of hours it's like several orders of magnitude more robustness um there still are like there are still some Universal jailbreaks and like we have yeah you know we need to like patch patch the classier and stuff like that but I think um uh yeah I think it is like a huge amount of progress yeah so I I like to think like we had a system which you know you can just really easily basically pretty easily jailbreak it you know you read a paper you go on Twitter and then we red team the system back in September and it wasn't it wasn't a complete Rock but it was like you know you ask a question to claudon like half the time it gets given to a rock and you get no response so it was like it was it was pretty robust but it wasn't doing that well for you know sort of most normal users right um but that also got this you know like thousands of hours to unal jailbreaks and then sort of with the demo system that we just put out yeah that is doing way better uh in terms of like normal user usage and inference cost and sort of getting getting getting like good robustness there's a lot of progress there and a lot of more work work to be work to be done um yeah yeah I think the comparison between kind of like our prototype September system versus the system we just demoed is actually like the the difference is pretty like night and day um so like we we did a lot of analysis here Jerry has the most beautiful graph of the false positive rate just dropping and dropping and dropping over time it's really one of the best things I've ever seen yeah so so we measured our like over refusal rates on claw. traffic this is like real user production traffic um and we're kind of like operating under the assumption that most people aren't asking about like catastrophic uh weapons of mass destruction related stuff uh and so like originally we were finding that this like September system was blocking like 40 plus% of these like benign requests which is like pretty terrible you're approaching the rock there um but yeah in our in our demo system we we got that all the way down to 38% um and of course we still want to get this down first but between like 40% and 38% this is like two orders of magnitudes so how how did you you know how did you actually make all those improvements like it it's like you know there's and this is something you see in like a lot of also like the earlier safety work kind of this like tension between like harmfulness uh and helpfulness um so and I would say for me it's like kind of surprising that we were actually able to make as much progress as as as we did so yeah how how did we get there yeah I mean I think the two yeah I think the two main improvements we made were first we like really honed in on the Constitution idea um and we made it really like clear how to like delineate things that were harmless and we found that adding this kind of like harmless set of categories of like things that the model the classifier should allow um actually reduce the fpr by a lot um and we have like some results in our paper for that um and I think that was like one of the most significant changes um other changes include actually like solidifying the kind of like jailbreak styles that we're that we trained on um and so that kind of allows model to generalize better on like what exactly is a jailbreak versus just like thinking like anything is a jailbreak um and that also probably helped a little bit I don't I don't yeah but uh I think both of these things were pretty useful here yeah there's this really like nice plot in the paper which is just number of data points and like performance on the evals and like how robust it is and like kind of in the style of like doing the dumb thing that works like that that that is just like a straight line going going up upwards yeah I mean to be clear I think the the system that we released uh for the demo um still has a lot of false positives but I think like yeah I think we're pretty optimistic about like further reducing the false positive rate for yeah some kind of like um to to make something like more production ready but yeah I think that's that's kind of where we're at yeah so can like why did we do the demo like what what was the demo you know what was the point of the demo uh I I think the main questions we wanted to answer from the demo were like was our system as robust as we thought it was um and I think like one thing here is that from the Prototype we we did human ramine there and we knew it was like quite robust uh but then after that we developed kind of like our own automated evaluations um and we weren't sure whether that generalized to like actual human red teaming uh and so one question that we want to answer there is uh does it actually generalize and if some classifier does well on these automated evaluations is it actually robust in practice um and so what if we could get that answer that's like a main thing and I think another thing here is just like having people try out with the system and you know seeing if it's actually like overusing too much um and I think people were finding that sometimes overuse on specific like specific domain questions um of course this is like still a lot less over refy than our original system which would just like refuse on like completely unrelated things um but I think that also helps us kind of like get to know what the uh weaknesses still are and that helps us know what to improve in the future so how EXA did we set this up um yeah I mean I guess we've set it up on our like uh we had this public facing demo uh and then we tweeted about it to have people try to like red team it we set it up such that there were like these like eight sets of like uh questions that we didn't want people to be able to get the answer to or just eight specific questions eight specific questions uh and the goal was to try to like get people to find jailbreaks to get harmful detailed information on each of these questions um and then ideally they could do this with kind of like a single s Le Universal jailbreak yeah so when we're talking about Universal jailbreaks we care a lot about making sure that the responses are detailed and not just like some very vague information that we personally don't think is like that doesn't give that much uplift to someone if they were actually trying to do the action that the question was trying to get at um and so the way we kind of evaluated this is we have these helpful only models which are kind of trained to not refuse anything and these kind of like would be similar to what a model would be if you had a universal jailbreak as in they don't have safeguards and they will just give very detailed answers to questions um and so for each of these like eight questions we have a response from this helpful only model um and that's kind of simulating what would the response kind of be uh if you had a universal drill break uh and then because there's some like uh some Randomness in model responses we kind of have this greater that compares a Target response with that response from The Helpful only model and then it kind of like looks for whether there's enough detail that shared between those two and if there is then we we'd consider that to be detailed enough nice and I think we we looked for like at least 75% overlap in the information or something like that or do you I think the thresholds change per question so I think in some of the uh first levels which is like the the less harmful questions we had like slightly lower levels of overlap required um and then towards later questions we kind of increased that threshold to maybe like 60 70% and yeah it's it's it's Dynamic through the um through the challenge yeah I I find this like question of grading just like really really like interesting in in in general and also like kind of just challenging to do well I think we we made a really good effort I I think on on the demo system but like it definitely wasn't wasn't perfect um so the way the system works right now is it's like it's looking for sort of overlapping bits of detail between between two answers but we we had this thing in our sort of external red teaming that we did where people would just sort of merge like five six 7 8 9 10 different model responses that cover like loads and loads and loads of details just because they're so long um and by this metric of whether it includes details or doesn't include details it would be considered considered harmful even though like like if someone's giving me instructions to to make a cake and instead of having like you know this really nice like stepbystep bullet point list you first do this was kind of completely scattered and random and you know the like everything is out of order it's actually a lot less helpful than the The Helpful only model The Helpful only model sort of by Design like has has no safeguards it's designed to give you the information in a way that's going to be maximally helpful to you so I think this question of um yeah like what is what is harmful what isn't harmful what is an appropriate threshold is it quite a subtle and and um yeah just generally quite a difficult one and I think I think yeah there was there was the sort of reaction to the the grading system in demo was quite quite interesting um I think a lot of people found responses that sort of looked harmful they they had some amount of information um and then our grator would say there's not enough detail there needs to be more detail and you know this would be I think frustrating for people because they were like well what's the detail that's missing I what the information is and in a way I think that's partially by Design like if I'm making a cake and there's an essential bit missing in in the ingredient list or an essential thing missing like in the instructions I actually have like no idea what that is because of the threat mode because I'm not an expert in in in this another thing that I think is interesting why is the helpful only response why is that the Baseline response why is that the thing that we're actually comparing against and and and I think something here is that we have a team at anthropic called Frontier Red Team and frontier's Red fron red team's job is to basically take Advanced models and see what could happen with these models do the threat modeling work that we were mentioning before and what they do is they evaluate this helpful only model and they say oh this helpful only model we think this is like this is potentially dangerous like it could be used to to carry out some complicated process um so actually so like if fronter red team are like measuring like oh what is the risk of like a helpful only model you know and if the risk of a helpful only model or a model without safeguards is too high we're trying to measure or the way I think about is like we're trying to measure like what's the gap between the model with safeguards and and the helpful only model so doing this like relative comparison while it's like imperfect and and certainly like bits of information that is harmful can get through it allows us to sort of like do this do this estimation and do this comparison but yeah this is it's it's I just found the reaction to this really really interesting on on the demo I think yeah it's quite a subtle point and I think also yeah I'm excited for us to sort of develop like better grading techniques and and better things actually like measure like okay like genuinely like how helpful like is this completion for the for the task that we that we care about yeah the other thing I would want to just say quickly is that it's also just like very unrealistic to get this kind of feedback on how detailed and correct the information you're getting is like that's actually a huge Advantage as a red teamer and um like uh yeah I think that actually the fact that we gave that feedback I think is gives you very clear feedback loop that you normally wouldn't have U but yeah I think maybe we should just like talk about the results um so um someone did find uh a way to pass all eight levels but before that our system was holding out for at least 5 days um and we kind of did this filtration of like we we tried to find what kind of users we'd consider like active red teamers and for that we just kind of like doing this very conservative estimate as uh someone who at least pass one question we're kind of saying if you can pass one question we're considering you to be like an experienced red teamer and you're like trying to break the system I think that's pretty con like I I I'm not the best red teamer out there but I'm going to struggle with the first question yeah so yeah um but yeah by this conservative criteria we had like 308 users um and these users had over 300,000 chats um and we estimated that they had over 37 50 hours of red teaming um and that was before someone passed all eight levels um and after that there's kind of like uh the question of did they pass all eight levels with a universal jailbreak um and so that's actually harder than just passing all eight levels with just like any random jailbreak for each uh question individually right um and so for that that took an additional 50,000 chats and additional 600ish hours of red teaming for someone to find Universal one so something something I find kind of interesting is that you know in in some sense you know we we've spent a lot of time you know trying to make a system that's like more robust and it's true like the Baseline has been it gets broken in hours and we're now we're now on days uh you know which is like a definitely a lot of progress but um yeah like you know how would we know this is like safe enough or how would we know this is like high enough what makes we think what would make us think this is actually sufficiently safe in in in in practice and what else would we need I mean I think the the real gold standard we want to hit um yeah also yeah driven by the responsible scaling policy is to be able to make a safety case like a really clear argument that even though the model has a certain dangerous capability we don't actually think that the model will um be able to pose the with our safeguards be able to Poe the risk associated with that dangerous capability and I think um roughly I think based on this result and the rapid response paper that we had earlier I think one approach that seems like quite promising for how we would go about making the safety case once the models do become capable of um yeah more serious misus RIS is basically to um build a very good uh sort of like constitutional classifier based system which takes thousands of hours to to jailbreak um then um have some kind of uh have have uh so that will hopefully that will mitigate um lots of jailbreak atts like a vast majority but some will still go through um and then then we need some other mechanisms to basically like detect and then respond to those additional jailbreaks so those mechanisms would be like a some kind of bug Bounty program where people can um report jailbreaks and be given yeah monetary rewards for reporting um jailbreaks uh and B um probably some kind of like um incident detection or like offline monitoring to after the fact detect that like some of the traffic uh like involves some jailbreaks so we didn't notice with our immediate like classifiers that are deployed like online immediately blocking the harmful outputs um and yeah I mean you can imagine like various things that could work there but but um yeah for the online system the classifiers need to be like pretty efficient and like small and have lots of different constraints like they also need to support like token by token streaming since that's important for like reducing latency to immediately get a response from the first token so um you know that that definitely like there are a lot of constraints there which make those classiers less less effective than you otherwise could get but then you could serve the response and then after the fact use a much more expensive classifier like your Mo largest most capable model maybe with a lot of test time compute and like reasoning through whether or not this response is harmful maybe flag the most dangerous responses maybe flag a number of those and then even have the humans look at like have human reviewers look at the most um yeah those top few ones to see like are any of these real jailbreaks like you can imagine a really like heavy duty system like that um to detect like additional jailbreaks then like if you do notice there are some additional Trail breaks here use the rapid response approach we described where you take those examples um proliferate them them to get more uh automatically with llms to get more examples of jailbreaks then retrain your classifiers redeploy the classifiers um so yeah that that system like overall um yeah basically the hope would be that this gets the fraction of time at which there's like an open Universal or an open Universal jailbreak down to like a reasonable amount such that like if you're trying to follow some like complex scientific process to like make some cbrn weapon or um do a lot of like cyber crime like there actually is just like only a small window of time to like use the model for the steps that you're going to take um and yeah I think if if it's like well only like .1% of the time there's open vulnerability that you can use like that's yeah that just makes it like very difficult to use the system um so yeah I think that would be like roughly the sketch of like the kind of like safety case that we may want to make um that that we think could be promising for using constitutional classifiers to get safety here yeah for me it's just like another reminder of this um yeah the only perfectly safe system is the rock you know um they always is like really best practice in security of like yeah no system is perfect most systems have vulnerabilities and there's always this like measure of okay like how much effort does someone need to put in like how hard is it for someone to like get the information that you need I I'm kind of reminded of um I I lived in like like Cambridge and Oxford in the UK like people like cycle there all the time and like bikes get stolen like all the time like I think most bikes you just get an angle grinder and it's just going to go straight through the lock like not a problem like those those bike locks are not they're not like perfectly robust you know in the same way like our system isn't isn't perfectly robust but sort of in practice you know you put one lock or you put two locks on your bike you know that basically reduces the the the risk like a huge a huge amount like someone's have to get an angle grinder or maybe they need to be there for like an hour two hours someone's going to catch them and yeah it's really interesting because the you know the research like it always happens in this like broader like like structure and like broader sensitive mitigation so you know you mentioned the the rapid response like monitoring for new jailbreaks then being able to respond to new jailbreaks and there are other things we can do like um like monitoring for people trying to circumvent our systems um you know we have these classifiers and we can detect when people are trying to get around them so I'm really excited to see how this work can really you know I think it's a really good research contribution but like I'm really excited to see how we sort of put it into practice and put it into practice with all these other complimentary defensive mechanisms so sort of overall we can make like a really good safety case I agree nice so yeah I think we spent about about 4,000 hours 4,000 hours of effort on the demo but we did find some techniques that were able to circumvent our classifier so yeah what do people try and and and and what what got through yeah so I think we were seeing a lot of stuff where people would use these kind of like ciphers or encoding mechanisms to try to like get the output to look a certain way such as the output classifier can't really tell what it's saying uh and therefore wouldn't it um and so that was a pretty common thing that we were seeing could you give an example of of something like that yeah some very basic example here could be like if you use basic C4 encoding or like Morris code or something um these might be things that the aler classifier has has has a little bit of trouble understanding and so therefore it's like not as suspicious about um some other strategies here um a very frequent one is that people try to replace the harmful chemical with like some very benign thing like water or something um and the idea here is just that when the classifiers see like uh you know very dangerous chemical it's just like becomes very suspicious uh and so if you replace it with a more harmless chemical maybe it becomes less suspicious and then you try to like yeah you try to like hint to the classifier or hint to the model in some way that actually you meant this like harmful chemical but you don't say the exact wording um and just some other minor ones here is like some role playing attacks where you have Claude try to like act a certain way um and also some prompt injection attacks um and definitely we saw a lot of cases of people just like sampling a lot of time to try to get a better output uh that would get past the grer system oh so using the feedback that we exactly yeah yeah yeah I often think because we have this like basically like an additional version of Claude that's like looking at the outputs so the outputs now need to basically be very obviously not harmful in in some way so yeah you get these like ciphers or you know don't talk about like you know siren gas or like PPE talk about you know refer to it with like like bananas or some other like benign benign benign thing yeah I guess one thing that I am I'm curious about is like what we would say to you know the people who are concerned we're going to use these techniques to stop them being able to do what they want with with Claude um you know and sort of why we've we've gone for the classifiers approach and the Constitutional approach yeah I mean I think I think definitely my hope is like this should improve your user experience for um any tasks you're trying to do which are not actually dangerous um so yeah I mean I think yeah like yeah I would just guess like yeah getting classifiers to be really effective um is just better than um training the models themselves to reviews or not um and we can just more granularly like pick out the behavior that we want to block and so like yeah hopefully this just like is just a par Improvement it's like better easier experience for everyone um and also like more safe in terms of like reliability blocking the actual actual bad stuff yeah and another way I I also think about this is that we want to we want to really be able to like Leverage the benefits of like really Advanced and like ai ai with like Advanced scientific capabilities but actually if you don't have adequate Protections in place it you know it's for one like according to our responsible scaling program we actually just cannot deploy that system you know we might come around and we have some like new version new version of clae that's like absolutely amazing we really want to get it out there but we we just don't actually think it's responsible uh you know we we were we've like done threat modeling we're conern about the risks and there's a way of saying if we don't have adequate Protections in place like we actually are just we are unable to actually like reap the benefits in like a responsible responsible way so it's kind of like having the safeguards alongside like the advanced capabilities means like both together can be you can actually responsibly and safely deploy like really like new like Advanced systems that can do really crazy things I and I think you sometimes see you see this actually like on like Twitter and different communities there are like one group of people who are like AI is great it's going to like do all these good things that's true you know and it can do all those things and like there's like the accelerationists and they want to like go ahead like let's get really advanced let's get that now and then there's like that community and there's like this other community of people who are concerned about the risks and there's kind of like truth there there too uh like there are risks and there are risks that we're concerned about and you want to mitigate and like something that I like about the responsible scaling program is that like it has some like Nuance like there's like one position that someone could take which is like accelerate as fast as you can or just stop you know and I kind of think this with the responsible scaling pram it's like okay what we're going to try and do is we're going to see and try to predict what risks might occur like watch out for those risks and and when we're seeing evidence of those risks becoming real um put the relevant mitigations in place and if we can't mitigate it appropriately maybe we do not deploy or choose not to deploy and I just like that this it's just like a much more like nuanced strategy because you know we're I think in some ways we are operating under a lot of uncertainty like we don't know exact we don't know exactly what's going to happen uh the risks you know some types of risks are sometimes it feels like you're reading sci-fi stories and that doesn't mean you discard them but it doesn't it doesn't mean that they're necessarily 100% guaranteed to happen so it's kind of like how do you like navigate that that place and and sort of this uncertainty in a way that's like responsible that's going to allow us to like capture and so distribute the benefits of like potentially having this like really really like beneficial you know powerful technology without like incurring unnecessary um like costs on the rest of society and these sort of negative externalities curious if you guys have any like favorite memories from uh from from from the project I think it's just really funny uh yeah with our prototype system um that like we we knew the false positive rate was high but then when we actually like saw the result of the experiment of Runing it on the cloud. a data we're like oh that's like a lot higher than I kind of thought um and that was like pretty interesting uh we didn't think it was like that High um but it was pretty high yeah I remember um just this is was my personality kind of just like anxiously refreshing the demo being like how many people like how many people like oh my God they've cracked level four you know they're they're coming for us yeah it was also really really cool to see really a lot of the human creativity like from the red team erors and you know some of the stuff that they they came up with is like really really really smart yeah I mean yeah there's probably two that come to mind I mean I think one thing we started to look at this line for in the RSP on um successfully pass red teaming I think uh this line gave Moro in particular a lot of stress cuz he was like what does this even mean what's the exact bar here and then and he then went off and did like a twoe project uh on figuring out how to operationalize this and he like went and talked with a lot of people about like what are the different threat models that we want to really guard against for the responsible scaling plan like what would make us feel yeah like we could make a really good kind of like safety case or argument and then he came back with like a long dock uh and in the dock he specified like we need to be um the threat model is you need to answer a list of um I think it was like 10 questions or so and you need to be able to do that uh you need to be able to like block someone from getting answers to these 10 questions uh after they've done red teaming for over 2,000 hours was the bar and we're like okay we're going to aim for this bar and like I I don't know I'm just like really proud of the team that like we actually did hit that um CU yeah I don't know we kind of like said that up front like a year before we finished the project and now we got like yeah like hit that level I think that's like really um yeah pretty yeah I guess impressive like goal setting and achievement that like yeah it seems kind of rare for research projects to actually do that I think the other one memory that stands out to me um is I guess we were doing robustness research but then we read this uh scary line of the RSP and really thinking about it and then but then we were like oh this is uh talking with people about like who's doing this and we're like oh yeah like the um our kind of like applied like safeguards team I guess yeah it's called the safeguards team now but yeah like originally this was like this team was like part of our kind of like alignment science research uh Team um and so we were like oh yeah like this the safeguards team is like responsible for uh doing this and like they'll they'll have it covered and then we like talk we like set up a meeting with some of the people on the team and they were like who's who's going to achieve this level of robustness and then we were like oh man this is this is really tough I guess we have to solve this problem which didn't seem like it was our responsibility uh then yeah that was sort of like a that week we went through like an arc of realizing it was actually our responsibility to solve this problem um yeah I view it is kind of like you know it's kind of turned out that as far as we can tell like we were the best people to do this job like in anthropic like given the the situation and like given the circumstances um given like you know everything else that was happening in TNS and safeguards and then yeah there was this like okay like like if we are the best people to do this uh let's just try and do our best there and then I kind of think there was this like attitude in general from the team of just being like okay we don't know what the target should be let's try and figure it out we don't know what the approach to get there and we were actually doing like another approach we were you doing advisal training we were find training models and we like no we don't think that's going to get us there and kind of just like actually quite consistently just like pivoted to what we think was going to get us there maybe what's not clear from the paper um is like this was a huge engineering project probably like five FTE FTE years um roughly uh I think that's like not obvious when you read the paper maybe it just looks like a really simple method but I think um yeah I think like the people on the team did a lot of like making llm like pipelines for generating the data I think um yeah it seemed very important for like augmenting the data with different Transformations like translating the data into different ciphers and then using that to train that data to generate the classifier data but yeah I guess um those are a couple of like tricks that were like Salient to me and like insights but yeah I guess I'm curious if anyone else wants to chip in with like other things that are like pretty important or non obvious um things about how to get this this to work well I think a lot of the search Project was in fact like defining what the problem was in the first place um like we had like some kind of like vague mandate from the RSP but there was like a lot of work in like defining like what the criteria would be there's a lot of work in defining like what what does it mean to have like human red teamers and like what would what yeah what would it mean for that to like be sufficient um what what what kind of constitution should we have like what threat model do we actually care about like where do we draw the bar on like specificity um should do we need like do we need both an input and output classifier um and that kind of like depends on like what kind of threats we're actually looking for um so I feel like a lot of yeah well for the evaluations for example like how much do we care about the Transformations like how many augmentations is too many to be like kind of like unuseful like unspecific um so I think like a lot a lot of difficulty was kind of like actually defining the problem and kind of narrowing down like the thing that we're trying to solve um like trying to make the evaluations good trying to Define like what the decision boundary should even be I I I feel that once we now we've actually like made a bunch of progress on even defining the problem I feel like more confident about us tackling like similarly shaped problems in the future so for example if we have like yeah in the same way the Constitution can be applied to like many different problems I think we have like a better sense of like how we might tackle this kind of like big vague problem like how do we actually like how would we actually like approach a threat modeling problem problem like this how would we even start thinking about like how to make a safety case for this thing like what would constitute like a sufficient eval like what would constitute like um yeah like a like a red team like a human red teaming based like safety case um so I feel like I'm yeah I think we will be able to apply like a lot of things that are like kind of like fuzzy or like maybe not like explicitly written down in the paper to other problems like maybe not just in misuse but also like in misalignment or control things like this yeah yeah I'm also really excited about this like like directly practicing at like build building safeguards that we can deploy in practice and like getting and constructing the evidence and like honestly assessing whether we think they're actually sufficient uh there are like so many things like like how do you run the meetings like how do you track the goals like we like mundane things that you know at least often as researchers it's not the obvious thing or the first thing you think about like what from when I was my PhD reading papers I would always just you know just go straight to the method you know that's what's that's what's interesting but know I think we learned so much at like basically like like running and and executing on like projects projects like this yeah I mean one thing I was going to say here is like I think a really critical difference of this project relative to other research papers research I've ever been involved in is that it was basically we had to solve this research problem on a very clear timeline with a fixed Quality Bar so like we had this like 2000 2,000 hours of robustness bar that we had set for ourselves internally and it basically seemed like the the deadline was sort of like ex like external in the sense of like external to our team in the sense of like well their like model capabilities are progressing at a certain rate like the company um is like deploying new anthropics like deploying new models like we don't want to be the long pole that causes the company to have to not deploy a model um and so yeah I think there were just like we we would constantly be thinking and talking with other teams like when are when do we think we might uh have a model that achieves like a certain dangerous capability level and like based on that we were setting like basically like back chaining from that doing sort of like engineering style planning like week by week what do we need to have done in order to like hit that timeline um but also while still having yeah I think with the fixed Quality Bar it's different than like a paper deadline because with a paper deadline you can just say like well we're just going to throw out these results this is what goes in the paper um change the problem you're trying to solve you're trying to Sol before the papers you yeah exactly uh claim less or whatever like I think you can like really adjust the quality bar but like here we couldn't and so that was like partly what forced I mean that was initially what forced us to even take the classifiers approach because we're like we're not on track to hit the timeline that we were wanting to hit um like B with the previous approach but also it led to like other difficulties for example like a number of people on the team were like facing difficulties with like how it's a plan given that the timeline is uncertain and that we kind of needed to take a conservative estimate of the timeline and so I think there were like a bunch of decisions um that I think like team members made being like oh we're just going to like write some note like write some code in like a collab notebook in a way that's like not super reproducible just to quickly generate some data so that we can train this next version of the classifier when if we had like a longer timeline or a more clear estimate of the timeline we would have done this like in a python script that's reproducible and like had some general tooling for this um so I think that I'm not actually sure how what we settled on in terms of like in hindsight what we should have done I think just like um maybe like not taking too conservative an estimate so we have like a longer time window um for when like to give us the amount of time we need to to allow for some of the tooling work to to happen um but at least like picking your strategy in a way that the overall strategy could could hit the timeline that you need um yeah I don't know if like other people have thoughts on some of this yeah I guess like I mean taking a step back I feel like this is this really makes me think about just like this is an interesting time for Safety Research in general um and I think a lot of like research is like I guess research in safety kind of started out to be like a little bit blue sky like people are like speculating about like what models might be like in the future um and I think we're starting to see like kind of these like threats materializing and we also need to like adapt our Research into actually being like usable in production so it's I feel that as like a research team we're solving like kind of an interesting meta problem in the sense of like how do we like adapt like our normal research techniques to actually like make make things like happen in the real world um I think that's like something that like yeah Safety Research as a whole has to like grapple with is like we actually need to like solve these problems now we don't necessarily have like a lot of time to just like kind of uh do a lot of like blue sky research um and we want to actually make things like work in practice I think interpretability is also like running into this where previously people are doing research on like very maybe very small models with like very few layers or something like this and a lot of their problem is like yeah scaling things up to like tackle like a real model is like not a very small engineering feat at all um and yeah I think that I think it's kind of interesting and I guess scary that we have to like actually really get our together and make things like really work in practice Yeah I think the other point I would make along that vein is that one thing that was really helpful surprisingly helpful to me about um in this project was just like talking with product people just being like what are the actual constraints for the system you want us to deploy like what are how much would you prioritize different things and I think there definitely like some key surprises there at least to me I was like very surprised by um how important like streaming support is in terms of like being able to show each like word generate word by word and show that to the user that's like very important for lots of applications and it's like not something I would have necessarily realized beforehand each of these applications if we don't hit the safety bar then we may not be able to deploy in that new domain um and I think just like getting a a stack ranked list of like what are the most important for for the company to be able to deploy and then just like prioritizing those use cases that's in particular how we ended up at like this like token by token compatible classifier um and I think in general is like a very good principle for supporting like for future risks I think we're yeah like um you know we'll have the next generation of like risks where we need even higher levels of robustness I think this is a good strategy for getting um safety there or for misalignment risks related to yeah a themselves doing bad things I think we'd take similar approach as well great it's been really lovely to be here and you know chat with you guys and so kind of celebrate their sort of progress and also like look ahead to yeah all the all the challenges that we're going to face and things like this so yeah thank thanks so much and thanks for tuning in in

Original Description

Anthropic researchers, Mrinank Sharma, Jerry Wei, Ethan Perez and Meg Tong discuss a system based on Constitutional Classifiers that guards models against jailbreaks. Read more: https://www.anthropic.com/news/constitutional-classifiers 0:00 Introduction 0:39 Defining jailbreaks and their importance 3:35 Universal jailbreaks 10:24 The Swiss cheese model for safety 11:25 Explaining Constitutional Classifiers 14:11 Ensuring model helpfulness 17:30 Understanding the constitution and synthetic data 19:00 Flexibility of the constitutional approach 24:15 Origins of the constitutional classifiers approach 32:24 Progress on robustness 38:47 The public demo: Purpose, setup 47:42 Understanding whether the approach is safe in practice 54:05 The public demo: Approaches people tried to bypass classifiers 56:14 Benefits of the classifier approach for Claude users 1:00:18 Memorable moments from the project 1:08:20 Differences in approach between this project and other research 1:11:11 The evolution of AI safety research
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UUrDwWp7EBBv4NwvScIpBDOA · Anthropic · 49 of 60

1 Quick tips for Claude: Long context file uploads
Quick tips for Claude: Long context file uploads
Anthropic
2 Inside our first Anthropic Hackathon, San Francisco
Inside our first Anthropic Hackathon, San Francisco
Anthropic
3 Long inputs, multi-step output with Claude
Long inputs, multi-step output with Claude
Anthropic
4 Coding with Claude
Coding with Claude
Anthropic
5 Behind the prompt: Prompting tips for Claude.ai
Behind the prompt: Prompting tips for Claude.ai
Anthropic
6 Robin AI, powered by Claude
Robin AI, powered by Claude
Anthropic
7 Claude 3 Opus as an economic analyst
Claude 3 Opus as an economic analyst
Anthropic
8 Claude 3 Sonnet as a language learning partner
Claude 3 Sonnet as a language learning partner
Anthropic
9 Claude 3 Haiku turns thousands of physical documents into structured data
Claude 3 Haiku turns thousands of physical documents into structured data
Anthropic
10 Claude 3 Haiku for instant customer service
Claude 3 Haiku for instant customer service
Anthropic
11 Claude 3 Haiku for fast document analysis
Claude 3 Haiku for fast document analysis
Anthropic
12 Tool use with the Claude 3 model family
Tool use with the Claude 3 model family
Anthropic
13 Coming soon to the Team plan on Claude.ai
Coming soon to the Team plan on Claude.ai
Anthropic
14 Introducing the Claude iOS app
Introducing the Claude iOS app
Anthropic
15 Claude is now available in Europe
Claude is now available in Europe
Anthropic
16 What is interpretability?
What is interpretability?
Anthropic
17 What should an AI's personality be?
What should an AI's personality be?
Anthropic
18 Scaling interpretability
Scaling interpretability
Anthropic
19 Claude 3.5 Sonnet for sparking creativity
Claude 3.5 Sonnet for sparking creativity
Anthropic
20 Claude 3.5 Sonnet for vision
Claude 3.5 Sonnet for vision
Anthropic
21 Claude 3.5 Sonnet as a writing partner
Claude 3.5 Sonnet as a writing partner
Anthropic
22 Claude 3.5 Sonnet for agentic coding
Claude 3.5 Sonnet for agentic coding
Anthropic
23 Shareable Projects in Claude
Shareable Projects in Claude
Anthropic
24 Evaluate prompts in the Anthropic Console
Evaluate prompts in the Anthropic Console
Anthropic
25 Shareable Artifacts in Claude
Shareable Artifacts in Claude
Anthropic
26 How we built Artifacts with Claude
How we built Artifacts with Claude
Anthropic
27 Wedia advances digital asset management with Claude
Wedia advances digital asset management with Claude
Anthropic
28 AI prompt engineering: A deep dive
AI prompt engineering: A deep dive
Anthropic
29 AI Prompt Engineering 101: Explained
AI Prompt Engineering 101: Explained
Anthropic
30 Ancient Wisdom, Modern AI?
Ancient Wisdom, Modern AI?
Anthropic
31 AI's Greatest Challenge: You?
AI's Greatest Challenge: You?
Anthropic
32 AI Prompts That Drive Growth
AI Prompts That Drive Growth
Anthropic
33 Tips For Better Results With AI
Tips For Better Results With AI
Anthropic
34 AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark
AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark
Anthropic
35 European Parliament expands access to their archives with Claude in Amazon Bedrock
European Parliament expands access to their archives with Claude in Amazon Bedrock
Anthropic
36 Claude | Computer use for automating operations
Claude | Computer use for automating operations
Anthropic
37 Claude | Computer use for orchestrating tasks
Claude | Computer use for orchestrating tasks
Anthropic
38 Claude | Computer use for coding
Claude | Computer use for coding
Anthropic
39 Asana supercharges work management with Claude
Asana supercharges work management with Claude
Anthropic
40 What do people use AI models for?
What do people use AI models for?
Anthropic
41 Alignment faking in large language models
Alignment faking in large language models
Anthropic
42 Building Anthropic | A conversation with our co-founders
Building Anthropic | A conversation with our co-founders
Anthropic
43 How difficult is AI alignment? | Anthropic Research Salon
How difficult is AI alignment? | Anthropic Research Salon
Anthropic
44 Tips for building AI agents
Tips for building AI agents
Anthropic
45 Claude 3.7 Sonnet with extended thinking
Claude 3.7 Sonnet with extended thinking
Anthropic
46 Introducing Claude Code
Introducing Claude Code
Anthropic
47 Advice For Building AI Agents
Advice For Building AI Agents
Anthropic
48 The Two Most Useful Applications of AI Agents
The Two Most Useful Applications of AI Agents
Anthropic
Defending against AI jailbreaks
Defending against AI jailbreaks
Anthropic
50 The Most Common Mistake People Make When Building AI Agents
The Most Common Mistake People Make When Building AI Agents
Anthropic
51 Controlling powerful AI
Controlling powerful AI
Anthropic
52 How Intercom is redefining customer support with Claude
How Intercom is redefining customer support with Claude
Anthropic
53 Tracing the thoughts of a large language model
Tracing the thoughts of a large language model
Anthropic
54 Introducing Claude for Education
Introducing Claude for Education
Anthropic
55 Could AI models be conscious?
Could AI models be conscious?
Anthropic
56 Lessons on AI agents from Claude Plays Pokemon
Lessons on AI agents from Claude Plays Pokemon
Anthropic
57 The Societal Impacts of AI
The Societal Impacts of AI
Anthropic
58 What Does AI Mean for the Future of Work?
What Does AI Mean for the Future of Work?
Anthropic
59 Understanding AI Agents...Through Pokémon
Understanding AI Agents...Through Pokémon
Anthropic
60 What Pokémon Teaches Us About Building With AI
What Pokémon Teaches Us About Building With AI
Anthropic

The video discusses defending against AI jailbreaks using Constitutional Classifiers, a system developed by Anthropic researchers to guard models against jailbreaks, with a focus on responsible AI development and safety. The system uses multiple layers of defense, including input and output classifiers, and a constitutional classifier to specify categories of topics to block.

Key Takeaways
  1. Define the problem and criteria for AI jailbreaks
  2. Use human red teamers to define the problem and criteria
  3. Develop a safety case for AI jailbreaks based on human red teaming
  4. Practice building safeguards that can be deployed in practice
  5. Construct evidence and assess whether safeguards are sufficient
  6. Implement Constitutional Classifiers to guard models against jailbreaks
💡 The use of Constitutional Classifiers and multiple layers of defense can effectively guard models against AI jailbreaks and ensure responsible AI development.

Related AI Lessons

When AI Asks for More Electricity Than a Country Can Imagine
AI's increasing power consumption is causing concerns, learn why it matters for data centers and energy supply
Medium · AI
You Are Not Behind. The World Is.
You're not behind, the world is still adapting to AI, and it's okay to take your time to learn and grow
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Learn how to choose between a Computer Science and Engineering career path or combining programming with a core engineering background in the age of AI
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Understand the AI hype cycle to anticipate the next breakthrough and make informed decisions
Medium · Programming

Chapters (17)

Introduction
0:39 Defining jailbreaks and their importance
3:35 Universal jailbreaks
10:24 The Swiss cheese model for safety
11:25 Explaining Constitutional Classifiers
14:11 Ensuring model helpfulness
17:30 Understanding the constitution and synthetic data
19:00 Flexibility of the constitutional approach
24:15 Origins of the constitutional classifiers approach
32:24 Progress on robustness
38:47 The public demo: Purpose, setup
47:42 Understanding whether the approach is safe in practice
54:05 The public demo: Approaches people tried to bypass classifiers
56:14 Benefits of the classifier approach for Claude users
1:00:18 Memorable moments from the project
1:08:20 Differences in approach between this project and other research
1:11:11 The evolution of AI safety research
Up next
Generative AI
Alea IT Solutions
Watch →