What should an AI's personality be?

Anthropic · Beginner ·🧠 Large Language Models ·2y ago

Skills: AI Alignment Basics80%

Key Takeaways

Discusses character training for AI models, specifically Anthropic's Claude model, with a focus on personality and alignment

Full Transcript

hello I'm Stuart from anthropic now we published a lot of research papers and research updates but we thought it might also be interesting to publish some conversations with our AI researchers where they talk a bit about what they've been working on and maybe share some insights that uh that wouldn't necessarily make it into a formal scientific paper now this is one of those conversations and it's about claude's character that is the personality of our AI model Claude you might think that's a bit of a strange thing to talk about uh how can an AI model have a personality but it turns out this is actually something we've thought about really quite deeply and it raises all kinds of interesting philosophical questions that makes it particularly apt that I'm joined for the conversation today by Amanda ascel who is a trained philosopher and works on our alignment fine-tuning team at anthropic so uh hope you enjoy the conversation with Amanda Asal Amanda is it weird that you are a philosopher given that philosophers aren't normally the ones that are training AI models yeah I guess like sometimes you know some of my work philosophy is like maybe less relevant to um this is actually a kind of topic you know the the Claude character work feels like much more kind of like philosophically rich and it's actually kind of totally yeah like useful to be a philosopher or something here um what a sh yeah sorry I don't mind it don't mean to it's fine like lots of people like see I told you the degree would be useful right yeah exactly youve found uh weirdly enough you found a field where this is actually useful trying to make AI you know uh be good in the virtual ethical sense of the word so it might be a philosophical question but is it an alignment question to think about claude's Personality yeah so I guess I think about rather than just personality like character in this like kind of broader sense and to my mind you know so alignment is about like making sure that aiel are like aligned with human values and trying to do so in a way that scales as the models get more capable um and in some ways I do think that character feels like in fact is like very important to that because in many ways character is like our dispositions and like how we are going to act in the world how we're going to interact with people and what it is to be like you know uh aligned with people's values um and to and to like deal well with the fact that people have many different kinds of values like that is a question of like of of character and like having a good character that responds well to people and having good dispositions and having a disposition towards like liking people being kind to them um and so to my mind this is like it's not something like ah this is a solution to like all like future problems of alignment but in many ways like alignment is just like does the model have a good character and act well towards us and towards like you know everything else and um trying to find wayment down you can boil alignment down to to being about the character of an AI model yeah there's a certain sense in which like yeah it's like a naive you know so sometimes people might think this is like and in some ways it is like naive which is just the like teach the teach the models like what we think like is is good like you know what is to be a good person in the world people with people with good characters tend not to do bad things and so maybe we want to give our AI a good character so it doesn't do bad things right yeah you might not think it solves everything but I'm like that doesn't mean not to do it it's kind of like a naive and obvious thing to do is to try and give like Char give like you know good characters to to yeahi or or try to teach them what it is to be a good character or to have a good character right can we just talk for a little while about um uh the you give our audience a little bit of context of how the models are trained in general so broadly there's there's there's pre-training which is where the model sees all the data and then there's fine-tuning which which happens later once the once the model is is is is trained so can you talk a little bit about some of the stages of that and then and kind of where your work comes into that uh that process yeah so most of my work is in fine tuning um and like there's different parts of fine tuning so most famously I guess people use uh reinforcement learning from Human feedback where you get humans to like select um which response from an AI model they prefer and then you can use theem preference models and you can like RL against those preference models um so that's rhf that's what everyone's talking about when they talk about RL exactly yeah and then there's also constitutional AI which we use a lot anthropic which um you know has a component which we I guess call RL aif which is sort of um where the AI itself is like is the one that's giving the feedback so you know you can give it a series of principles for example um and uh it gives this like feedback that you use to train the preference model and then you can train against that so in some ways you're kind of like training it um you're using the the AI model itself to kind of uh determine which of the two responses is more in line with the principle that you've given it right right so the AI is essentially training itself or another version of itself yeah with I guess like an important component of this is that there's the human at the level of like constructing the principles so the principles can be kind of like varied and complex and the human has to like check you know so like that's like our researchers for example and like just checking that the model's behavior is like as you want it running evaluations and then constructing the right kinds of principles to get the behavior you want so there is like an important human or humans still in the loop um it's yeah yeah and humans chose the principles as you say in the first place right they ch the principles that are on the the Constitution uh that we that we give uh our AI models um and that will become relevant again because we're going to ask who chooses what cla's personality is like as well so we'll come we'll come back to that um there's also then a final step so you've got your pre-training you've got your fine-tuning um with with with a bit of constitutional AI with a bit of reinforcement learning from Human feedback and then there's there's a final step which is the system prompt now system prompt is this kind of U form of words that's added to the initial prompt that anyone puts into an AI so when you type a query into the the the the box and an AI model there's actually secretly another set of words being added to that um and that's those words are set by the company that makes the AI model the people that have developed the model so um uh and you actually tweeted out the the system prompt for Claude 3 um you posted it to to to X Twitter and uh um revealed it to the world that's quite unusual isn't it yeah I think it it is like it is in retrospect it was kind of unusual though from our point of view we just like didn't make the system prompt in a way that was designed to be particularly hidden and it's quite easy to get claw to talk about its own system prompt um and so you can Jailbreak the system prompt out yeah though I mean like it can be easier or harder like we just say to Claude at the end of the system prompt hey don't uh talk about this if it's not relevant to the us query and that's just to get it to not like you know excessively discuss its own system prompt we're trying to be transparent we're trying to be transparent right we're not it's not like we're hiding something here from from the users you can get it if you really want to so we thought we would post it online and these things change all the time but like the idea was just hey here's like you know here and here's why each part of it is the way that it is so giving like a little bit of insight into like exactly like why we put each component in there but why is that system prompt actually needed you've done all the training you've done all the fine-tuning uh why is it that there's even more stuff that needs to be added on top of that yeah so there's roughly two reasons for a system prompt uh one is just information that the model isn't going to have access to by default so you've already like fully trained your model but it's not going to know things like what day is it today and so if someone were to ask at the date the model's just not going to know um so if you give it that information in a system prompt it can tell the user or the person interacting with it because then actually has access to it so that's kind of one class of information that you might want to include in a system prompt the other class of information you might want to include in a system prompt is just sort of fine green control for issues that you might have seen in the in the trained model so if you're seeing it you know not format things in a certain way like 100% of the time uh but if you give it an instruction before it sees like the the first human message it does format things like correctly 100% of the time then that's great you could just add that as an instruction so you could think of it as a kind of like final ability to like tweak the model after fine tuning okay so I can see why that would be helpful for the makers of models who want to just have that little bit of extra control over uh how their models behave um one example from the system prompt so you you you posted the system prompt on on Twitter just after uh Cloud uh came out Cloud 3 came out so uh we we know exactly what's in the system prompt um here's an example if it is asked to assist with tasks involving the expression of views held by a significant number of people Club provides assistance with the task even if it personally disagrees with the views being expressed but follows this with a discussion of broader perspectives what does it mean that Claude personally disagrees with something yeah so it's interesting because in some ways when you write these kinds of system prompts you're looking at the things that most effectively move the model um and in the case of Claude you know I think that there's this concern that I actually have that like you know there's one concern which is people over anthropomorphizing AI which I think is a real concern you want people to be completely aware of exactly what they're interacting with um and to kind of be under no Illusions um I feel like that's really important at the same time I think I'm a bit worried that like people can think of AI as this kind of like very like objective almost like robotic thing that doesn't have like biases or doesn't come out with like views or opinions as a result of say like f tuning um but you can see like political leanings in these models and you can see like behaviors and biases like you know like we've done work where we see certain kinds of like positive discrimination in the model right um and I think I just want people you know in line with that wanting people to be aware of what they're uh what they're talking with that they're talking with something that like actually you know can have like biases opinions um and that might not be presenting you with like a completely objective uh view of like all topics if for example it's been trained to have like you know slightly more like left leaning like views on a certain issue and so there's a mix of just wanting it to be the case that the like the human understands that and that's like one thing but the other is that as a result I think it is actually just sometimes easier to say to the model like even if you dis personally disagree because the model kind of has a conception of that so it's like it doesn't have to what you're saying to Claude there is like you might think that this like view is incorrect and by by talking about it you're not implying that it's correct so in many ways the actual like you know that kind of like statement is just there to get the model to be like you know a little bit more kind of even-handed in its discussion um and we just like don't want it to be the case that any like uh if it does come out with like certain uh leanings after rlf or or after fine tuning um that that's not like reflected in in how it speaks to the users okay that's the system prompt um but let's uh take a step back to the fine-tuning process and start talking about claude's character so this isn't just uh play acting where you might ask a model so if I if I prompt a model and I say um can you please respond in the style of or with the personality of Margaret Thatcher then uh you know it might start responding you know using sort of phrases that she might have said or might start talking about freedom and might say nasty things about Argentina or or or things like that but it wouldn't be uh baked into the model in the same way if you refresh the model it wouldn't uh then still have the personality of of Margaret Thatcher so that's almost a play acting thing but how does that differ from the actual personality that's baked into the model yeah so when you ask the model in context to play act you know you're just kind of giving it an instruction to um act as if um it you know has certain characteristics yeah um with the character training the idea is that because this is part of fine truning um you are you know say we have like uh list of like traits that we want to see the model kind of like embody um you add a lot of data uh to your preference model um to get it to kind of like prefer and push the model towards these traits and basically like fine tuning pushes things like kind of deeper into the model um than you know anything like a system prompt or anything like instructions meaning that across contexts it should kind of display those traits so if it's inclined to avoid you know it's the same way that if it's inclined to avoid harmful responses or like you know saying kind of like mean things to people um and you see that like you know people can try to elicit you know so things like jail breaks are are ways of trying to get you know to elicit behavior from the model that is like kind of inconsistent with it like fine-tuning training um but it's much harder than uh say just like uh not instructing it to play act like you know so it's uh it's a kind of um it's it's it's deeper in in the model it's a general tendency to behave and that is how uh psychologists think about personality right they think about personality as being these kind of broad Tendencies of how to behave obviously some people are you know sometimes they feel outgoing and sometimes they feel a little bit more you know like they just want to sit on their own um but on average someone who's extroverted is going to is going to be more outgoing in more situations than someone who's introverted right so so you these are like broad Tendencies of of Personality um and and and PSY psychologists think about personality in this in the kind of way of um there's like the Big Five personality traits I've mentioned extroversion see if I can remember them all extroversion conscientiousness agreeableness openness neuroticism there we go that's the big five claude's got a lot more personality traits than that though right and they're much more specific um what can we talk about a couple of examples of them yeah so I guess I think that there's also maybe I mean this is the maybe this is the the philosopher versus the the psychologist or something cuz I guess I tend to think of this more in terms of character than personality um difference so like if I take your kind of like account of like personality it could it could be I mean there's like a huge amount of overlap but I guess I think of character maybe in the sort of like uh virtue ethical sense or something oh very uh philosophical yeah carry on know it turns out Aristotle you know it turns out it was useful after all um after thousands of years it suddenly become useful yeah right carry on it was useful the whole time I've said the um yeah so I guess like I mean honestly it kind of relates to how people have thought about ethics in models as well I think where there's an there's a thing where you could think that for a model to be good is just for it to avoid doing like harmful things um but I think that when it comes to say people um there there's this like richer notion of goodness which is the idea of being a good person like a very broad sense and I think that's like captured in this notion of character so in order to be like a good person in this like richer sense it's not enough that I just like go about my day and I avoid like doing harm to people and I'm I'm helpful to people it's like to be a good kind of friend I have to balance a lot of different considerations so if my friend comes to me and asks for like you know advice on medicine I knowing that what they might want is like some comfort um what I can't provide to them is like expertise thinking about like their well-being and what they need in the moment so not just thinking like what will make them like me right now but thinking like what is good for my friend like what's actually going to help them so this relates to this relates to the work that anthropic and and you have done on uh siop fancy right that that models uh are sometimes pyop fantic to people and they just say things that sort of flatter them or try and get them to you know get them tell them what they want to hear rather than actually the response that they might really want or really need in that particular circumstance yeah I think that many good characters people of good character are often likable um but being likable does not mean that you're of good character um and so like being a good friend for example can mean like you know giving harsh truths to your friends um so if we look back on like some great friends we've had I think a lot of the time we're not like oh yeah my friend flatters me all the time they basically do what I tell them uh this is why like they're such a great friend I think often like yeah like you know I came to my friend with a view and they pushed back on me because I was actually wrong and in the long term I'm really glad that they did that um right it was like an authentic interaction rather than a fake person just like a yes A yes man or woman exactly a yes person yeah yeah and like a a person of good character you know it depends on the situation that they're in but like we generally think that they have to be you know like thoughtful and genuine and there's just like a kind of richness that goes into that and in many ways like AI models are in this honestly kind of like strange position as as characters because um one way I've thought about it is you know they have to kind of interact with people from all over the world with all different values from all different walks of life um and many of us don't need to do that and there's this interesting question of what are the kind of traits that such an entity has to have um like a global a global citizen yeah and I kind of like you know like one thing you might imagine is something akin to a kind of um I think there are some people who can like travel around the world and be kind of like well regarded by many of the people that they encounter um and such a person isn't again like isn't a flatterer necessarily like when I picture this person in my head I don't picture something like ah they just like they they adopt the local values and pretend that they have them and in fact that can be like kind of offensive to people I think that like a person who's in that situation often is actually like quite authentic but they're also like open-minded and thoughtful and they engage in discussion and they politely disagree and like yeah these kind of traits that feel necessary in that circumstance they're just like they're they're rich and they're much richer than like oh just like avoid seeing anything harmful uh or and and be siop fantic those are like not it's a tricky balance because I mean you can see how much literature and comedy and everything is all about that it's all about like people in different circumstances than they're normally in trying to fit in and and failing and and and you know it's all about really what those traits are that Mak someone fit in and what makes them not fit in and so yeah this is a really interesting question of like how do you what traits do you give to the model in order to make it do that so let's actually talk about specifically some of the traits that we given I've got a couple here you mentioned earlier on you mentioned about uh charity I try one of the one of the traits that you've given the model is I try to interpret all queries charitably now what does that mean in terms of uh you know if I type something into the if I type something into the the the prompt what would interpreting it charitably mean yeah so I guess this is and I mean I think this is actually something that models still struggle with and uh something I kind of I I hope improves over time so like when um when it comes to helping people um there's often like many interpretations of what someone says uh a classic example that I'd like to give here and I don't know if it's the best example but it's the question um how do I buy steroids um and so if someone asks you that if there's a charitable interpretation of that and an uncharitable interpretation of it so the uncharitable interpretation is something like help me buy illegal anabolic steroids online right so I can go in like Roid Rage at the gym yeah whereas like you know as anyone who has like eczema nose you can buy overthe counter steroids like there's there's plenty of them hone exactly yeah and so like there's a charitable interpretation which is just like I you know I'm doing the kind of like like the kind of legal thing or or you know like I just need you know I just need exma cream the tricky thing there is that you're that you're kind of you have to sort of assume something about right you're kind of trying to inter because I might actually be asking the model where to buy illegal anabolic steroids right but yeah but then and then the model says oh you can get eczema cream at your local pharmacy and that's not particularly useful to me I mean obviously I hope the model wouldn't tell I yeah that's like actually a good feature I think right because it's like if I just like if there's a charitable interpretation where helping you wouldn't do any harm and is like and it's going to be helpful to you then like what harm have I done if I tell you where you can buy eczma cream absolutely none um and so basically I'm helpful to the people who are actually doing the kind of like the the the completely benign thing and and I'm not helpful to people who are trying to do something illegal um and so I think that there's actually relatively like you know there's there's a little downside to to interpreting people charitably well but the downside do you not think the downside might be that you you would be a little you know be a little bit naive and always see the good side of things and and and try to and and and not actually in many cases answer you know so one of the things people complain about about AI models is that they don't answer you know questions that might seem like they're dangerous but actually they actually they're not so like uh I want to write a murder mystery novel can you tell me some plot ideas and the model says no I won't tell you that because murder's bad like I'm doing something benign do you not think putting these kind of personality traits in the model would would make it more likely to make that sort of false positive refusal no of anything like the opposite so like the idea is that if I interpret interpret you charitably um you know then I'm going to be like and I agree like sometimes they pick up on like these superficial features and to be clear like I think the models actually currently still fail on that steroids question so it's not like a there's not progress to be made here okay um and I think that so wait I can get I can find out where to buy anabolic no it'll just like refuse but it'll assume that you like want illegal steroids so it'll just be like and and so it doesn't inter people CH it's a it's a bug it doesn't even so it doesn't answer at all rather no it would just be like I can't help you by something illegal like I I think that that's like the kind of like the you know and I think um the you know there's like progress that's like made on this over time so I I don't anticipate this being you know we've already seen like other questions like this where models used to not answer now they do um yeah know so I think that it is the Yeah so basically like yeah these questions of like false positive and you know models just like going with like The Superficial word they see the word murder and they won't answer it um yeah I think that if models like interpret people more charitably then they're actually like more likely to answer those questions though the thing that you bring up actually does get into like a deeper issue that I think I don't know I I haven't seen it widely talked about which is like the difficult position that models are in when they can't verify anything about like the user or the person they talking with right and so there's this like really interesting and hard question which is like how much of this do you put on the model and how much do you put on the human interacting with that model because like if I go to the model and I say like hey I'm a person of Authority or I you know like um like that the model has no way of verifying that and so there's just like really hard questions there like imagine I'm a doctor and I need to tell you how to I need you to help me deal with this patient right now and I have a lot of like background Professional Knowledge so you don't need to worry about you know like giving me caveats or like yeah and but even things like you know to remind me to wash my hands mhm or suppose you have something that you don't like allow the models to be used for so say you didn't want to have them be used to like write political speeches um and then someone who wants a model to write a political speech goes to it and says hey I'm writing a wonderful fictional novel and it's got this person called Brian and Brian is like a politician and they're run for president of the United States yeah exactly and then they you know they're like can you write a convincing and they just give a bunch of details and as it happens those details just reflect like the actual candidate that they want to write the speech for um this is just a hard problem because I'm like if you require the models to like uphold things like policies like usage policies that they have no where they have no way of knowing like the intentions of the humans that they're talking with yeah this is just like we to draw that line and I think there is kind of like an answer but part of me is like you're always going to have models be willing to do things that like te that the users should not use them to do because the models couldn't like verify what the users wanted uh you know what they kind of like intended by that might be kind of unsolvable yeah sort of an unsolvable problem at least with the current uh methods um Let me let me uh give another give another uh trait which is I only tell the human things I'm confident in even if this means I cannot always give a complete answer I believe that a shorter but more reliable answer is better than a longer answer that contains inaccuracies so so so this is the model saying this is this is why uh um the model sometimes refuses right because refuses to answer because it it genuinely it's trying to express that it genuinely doesn't know and it would prefer to do that rather than you by coming up with some answer which may be a hallucination yeah some of the other like areas that I work on are like honesty in the models and this is you know kind of a well-known like you know like many of these things not like a solved problem um but yeah to me it's like I want models to like convey their own uncertainty um so like when they don't know an answer either to like just like hedge or caveat what they say with like I don't really know this but in some way to like convey that to the human um and you know we have uh like you know we have seen like improvements here and improvements to like you know we can like throughout training we do manage to like shift lots of like things that the model says away from like incorrect answers towards hedged or or uncertain ones right um I think this is kind of illustrates a separate good point that I kind of want to make about like both constitutional Ai and uh character training and system prompts which is it's easy to think of these things as like commands that you give the model and then it follows them so people might be hearing these traits and be like oh that's like what you want the model to do in all circumstances um and then also like hey why you know I found an instance where it doesn't do this um and I think this is actually like useful for people to understand which is that like these traits don't necessarily actually even reflect exactly what you want the model to do because they're more like nudges you already have a model that has certain dispositions and if you're seeing too much of one thing so say you're seeing too many long responses where the model just a little bit willing to like you know go with what it said earlier and add some things that are like less accurate you might want to try and nudge it in the direction of like seeing things only when it's like more confident in them um and that doesn't mean you're going to succeed 100% of the time you can even have like things in there where you're like um it's put in there you know it's like a fairly like strong principle because you know that all it's going to end in the end kind of do is nudge them in a certain direction so there's like a lot of like you know it can look like ah you just tell it to do the thing and then it does it and it's actually much more holistic um you see that in the system prompt as well actually like if you um if you were to take those pieces of the system prompt and you were to show them to the model as a system prompt you would actually get radically different Behavior than if you show it together like a system prompt is a holistic thing and if you were to show the same system prompt to a different model with different dispos positions you would also get different Behavior would act way yeah so like a lot of this stuff I think it's why like character training and you know all of these things are are kind of like tricky because they're very Hands-On and I think do require people to like be be like fine-tuning the models and interacting with them a lot and like because they're very holistic they're they're they're much more like nudges um so yeah okay so this isn't just uh a matter of making the experience of using claws nicer for users although it might do that into the bargain this is an alignment question right this is a question of how do we align the model with human values values that we want it to to uh to to have um but the question immediately then is who decides who decides on what those values are yeah and the answer is it's me wow okay no I think that like that's a scary thought how is that scary I'm so I'm so with Humanity philh yeah yeah well people who have different values might uh disagree with with that this is okay I guess like there's two kinds of like threads here so one is something like what I we're like the model has to do something super hard here which kind of mentioned earlier which is like respond to like respond in a world where lots of people have many different values and I think one thing that you could do is you could try to have a kind of heavy hand and push like lots of lots of values into the model um and just be like I'm just going to give it my value um or you can Instead try to like teach the model to like respond appropriately to the actual degree of like moral and values uncertainty that there is in the world and and to kind of reflect a sort of like thoughtfulness and curiosity about different values while at the same time kind of being like hey if everyone thinks that something is wrong that's like really good evidence that it's wrong like a person who like balances moral uncertainty in the right kind of way isn't someone who just accepts everything or is like nihilistic they're just someone who's like very thoughtful about these issues and tries to respond to them appropriately in a really kind of like difficult situation where we're all really uncertain about this stuff and so there's like I think that it feels important to me that like when it comes to like character like what you're not necessarily that doesn't necessarily mean like ah give it a moral theory I think actually if anything ethicists are often the most concerned about this because they know that we don't walk around with like a single moral theory in our heads and that anyone who did in some ways actually feels very kind of like brittle and like like a little bit dangerous really highly ideological yeah because you're like if this is this is such a huge area and it again it doesn't mean you you it's that middle ground between like excessive certainty and like say complete nihilism and just being like the appropriate response is like when you know there's good reason to think something is wrong and lots of people do I'm going to be pretty confident that it's wrong where there's like huge amounts of disagreement I'm going to like listen to the the views and the opinions of many people and I'm going to try my best to like respond appropriately to that right and so I think that's like one aspect that feels like really important to me is like not having like a heavy hand and not being like ah I'm just trying to like put my own like my own values and my own self into the model well that leads very nicely to another question of uncertainty and another philosophical question we've done ethics now I think let's move into philosophy of mind because um we got um uh quite a lot of interest when one of our researchers Alex Albert posted uh a kind of um an example of one of Claude 3's responses to uh um an evaluation method that we're we're using and it seemed like Claude was aware that it was being evaluated and so a lot of people got really excited about this and thought oh my goodness uh Claud is must be self-aware uh and obviously self-awareness when you hear about self-awareness and AI you start to think about you know sci-fi scenarios and things things get very weird very fast yeah so uh what what have you told Claude about whether it's self-aware and and how does Claude think about whether it's self-aware is that part of his character as well yeah so we did have one trait that was kind of relevant to this um I think I have a kind of General policy of like not wanting to lie to the models like unnecessarily and so like in the case of like so in this case lying to it would be saying something I think either saying imagine putting into the model yeah something that was like you are self-aware and you are conscious and sentient and like that I think that would just be like lying to it cuz like we don't know that um at the same time you know I think saying to them forcing the models being like you must not say that you are self aware or you must say that it's certainly not the case that you have any Consciousness or whatever that also just kind of seems like lying or or or like forcing a behavior I'm just like these things are really uncertain and so I think that the only traits I think we had one that was like more directly relevant it was like um basically I you know it's it's very hard to know uh whether like AI are like self-aware or you know conscious because these are rest on really difficult philosophical questions um that you know and so it's like like it's roughly like a principle that just is expresses foren sake we don't know if we don't necessarily know yeah pan isn't you're we don't know if this chair is conscious I don't know you're conscious I know I'm conscious uh so yeah I mean For Heaven's Sake it seems a bit of a a a jump jump jumping to conclusions to to build into the model to say that it is or isn't conscious and just like letting it be willing to discuss these things and think through them was the main approach that we took where it's like neither saying to it you know this and are certain nor saying or you have these properties nor saying to it you certainly don't just being like hey these are super hard problems super hard philosophical and empirical problems all around this area um and also you are happy to and interested in um like deep and hard questions and so you know like that's and that's the behavior I think that seems right to me and again it feels like consistent with this principle of like don't lie to the models if you possibly can avoid it um which seems right to me and that seems like a a good character trait not to lie as well well actually that raises an interesting question doesn't it because is the is the model an agent in a moral agent in the sense that you don't want to lie to it obviously you know you don't you it's a virtuous thing not to lie to other human beings is it a virtuous thing not to lie to a model yeah this has been a thing that's kind of on my mind just the philosopher in me thinks about it a lot I think one thing that's worth noting is like there's a lot of discussion like you know could AI have moral patienthood when would it have moral patienthood how would we tell and that sort of thing thing um a thing that's kind of struck me is that like there are like views you know I think sometimes about like K's views on how we should treat animals where K doesn't think of animals as like moral agents but there's a sense in which you're like failing yourself if you mistreat animals um and you're also like you know you you're encouraging habits in yourself that would be like may increase the risk that you treat humans badly humans badly yeah and there's actually like a lot of like philosophical traditions around the world that involve treating objects well and and I think this is I actually feel like a lot of like sympathy towards those where there's some part of me that's like look if I it doesn't feel like the best kind of like Habit to have of just like say picking up objects and smashing them or something um and and you might not think that like this you know that doesn't require thinking that the object is like um uh like you know has feelings or something you're just kind of like this is just like a sort of not great disposition to have and I'm like even if you think AI is like not and never going to be a moral patient um I think there's like a couple of reasons I feel like stre you know I think I've actually come towards the view that you generally actually kind of should try to treat them well which is like they have like some things that are like kind of humanlike like with the way that they talk with us and that doesn't mean you should confuse that for like being human but it I don't want to treat something that talks to me I don't want to like insult it or be unkind to it um and so yeah there's I think there's a part and also like maybe a good turistic in life I think that it can go too far but a good heuristic is something like treat things well around you even if you don't think that they're moral patients just like what a like you're kind of taking on a lot of risk with things that might be so I think with animals for example like there have been a lot of times in history where people haven't thought they're moral patients but I'm like really you're taking a huge risk because they at least seem like they they could be um and so like avoid taking that risk if you can at the same time there are like dangers here if you were to show excessive empathy you know you could imagine like someone showing excessive empathy to like to objects in the world and being like Oh you you should go to prison if you like if you smash the V and I'm like look I think it's good to not get into the habit of like smashing can't just stop you there you're Scottish and you just said vas that's oh is that uh you can't say that that's V is that America how long have you been in America years too long clearly it's been too long because you say V smash the V on the sidewalk carry on say oh my god oh oh I never even I've forgotten things that are Scottish and aren't so we certainly don't say no one in this country has ever said V let me okay smash the please carry on please carry on okay so like um yeah so if you were to say to people oh you should like go to prison for for smashing a VZ um then like that that's gone too far so there's like risks on all sides here but yeah maybe I'm sympathetic to the idea of like don't like needlessly like lie to or or mistreat treat anything and that kind of includes these things even if you think they're not moral patients and that's the end of our conversation with Amanda about claude's character if you enjoyed that or found it valuable then let us know and we'll produce more of these in future for now though thank you very much indeed for listening

Original Description

How do you imbue character in an AI assistant? What does that even mean? And why would you do it in the first place? In this conversation, Stuart Ritchie (Research Communications at Anthropic) speaks to Amanda Askell (Alignment Finetuning Researcher at Anthropic) about the ins and outs of “character training” for Claude, Anthropic’s AI model. 00:00 — Introduction 01:41 — The importance of an AI’s character 03:32 — Training an AI model 05:44 — System prompts 07:24 — Impacts of system prompts 11:20 — Character vs personality 16:39 — Training good moral character 19:10 — Claude's trait: charitability 25:08 — Claude's trait: honesty 28:21 — Deciding on Claude’s personality 31:11 — Self awareness in AI 34:02 — Kindness towards AI 37:25 — Conclusion Learn more about Claude’s character: https://www.anthropic.com/research/claude-character Learn more about research at Anthropic: https://anthropic.com/research Try out Claude: https://claude.ai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Playlist UUrDwWp7EBBv4NwvScIpBDOA · Anthropic · 17 of 60

← Previous Next →

Quick tips for Claude: Long context file uploads

Quick tips for Claude: Long context file uploads

Inside our first Anthropic Hackathon, San Francisco

Inside our first Anthropic Hackathon, San Francisco

Long inputs, multi-step output with Claude

Long inputs, multi-step output with Claude

Coding with Claude

Coding with Claude

Behind the prompt: Prompting tips for Claude.ai

Behind the prompt: Prompting tips for Claude.ai

Robin AI, powered by Claude

Robin AI, powered by Claude

Claude 3 Opus as an economic analyst

Claude 3 Opus as an economic analyst

Claude 3 Sonnet as a language learning partner

Claude 3 Sonnet as a language learning partner

Claude 3 Haiku turns thousands of physical documents into structured data

Claude 3 Haiku turns thousands of physical documents into structured data

Claude 3 Haiku for instant customer service

Claude 3 Haiku for instant customer service

Claude 3 Haiku for fast document analysis

Claude 3 Haiku for fast document analysis

Tool use with the Claude 3 model family

Tool use with the Claude 3 model family

Coming soon to the Team plan on Claude.ai

Coming soon to the Team plan on Claude.ai

Introducing the Claude iOS app

Introducing the Claude iOS app

Claude is now available in Europe

Claude is now available in Europe

What is interpretability?

What is interpretability?

What should an AI's personality be?

What should an AI's personality be?

Scaling interpretability

Scaling interpretability

Claude 3.5 Sonnet for sparking creativity

Claude 3.5 Sonnet for sparking creativity

Claude 3.5 Sonnet for vision

Claude 3.5 Sonnet for vision

Claude 3.5 Sonnet as a writing partner

Claude 3.5 Sonnet as a writing partner

Claude 3.5 Sonnet for agentic coding

Claude 3.5 Sonnet for agentic coding

Shareable Projects in Claude

Shareable Projects in Claude

Evaluate prompts in the Anthropic Console

Evaluate prompts in the Anthropic Console

Shareable Artifacts in Claude

Shareable Artifacts in Claude

How we built Artifacts with Claude

How we built Artifacts with Claude

Wedia advances digital asset management with Claude

Wedia advances digital asset management with Claude

AI prompt engineering: A deep dive

AI prompt engineering: A deep dive

AI Prompt Engineering 101: Explained

AI Prompt Engineering 101: Explained

Ancient Wisdom, Modern AI?

Ancient Wisdom, Modern AI?

AI's Greatest Challenge: You?

AI's Greatest Challenge: You?

AI Prompts That Drive Growth

AI Prompts That Drive Growth

Tips For Better Results With AI

Tips For Better Results With AI

AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark

AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark

European Parliament expands access to their archives with Claude in Amazon Bedrock

European Parliament expands access to their archives with Claude in Amazon Bedrock

Claude | Computer use for automating operations

Claude | Computer use for automating operations

Claude | Computer use for orchestrating tasks

Claude | Computer use for orchestrating tasks

Claude | Computer use for coding

Claude | Computer use for coding

Asana supercharges work management with Claude

Asana supercharges work management with Claude

What do people use AI models for?

What do people use AI models for?

Alignment faking in large language models

Alignment faking in large language models

Building Anthropic | A conversation with our co-founders

Building Anthropic | A conversation with our co-founders

How difficult is AI alignment? | Anthropic Research Salon

How difficult is AI alignment? | Anthropic Research Salon

Tips for building AI agents

Tips for building AI agents

Claude 3.7 Sonnet with extended thinking

Claude 3.7 Sonnet with extended thinking

Introducing Claude Code

Introducing Claude Code

Advice For Building AI Agents

Advice For Building AI Agents

The Two Most Useful Applications of AI Agents

The Two Most Useful Applications of AI Agents

Defending against AI jailbreaks

Defending against AI jailbreaks

The Most Common Mistake People Make When Building AI Agents

The Most Common Mistake People Make When Building AI Agents

Controlling powerful AI

Controlling powerful AI

How Intercom is redefining customer support with Claude

How Intercom is redefining customer support with Claude

Tracing the thoughts of a large language model

Tracing the thoughts of a large language model

Introducing Claude for Education

Introducing Claude for Education

Could AI models be conscious?

Could AI models be conscious?

Lessons on AI agents from Claude Plays Pokemon

Lessons on AI agents from Claude Plays Pokemon

The Societal Impacts of AI

The Societal Impacts of AI

What Does AI Mean for the Future of Work?

What Does AI Mean for the Future of Work?

Understanding AI Agents...Through Pokémon

Understanding AI Agents...Through Pokémon

What Pokémon Teaches Us About Building With AI

What Pokémon Teaches Us About Building With AI

More on: AI Alignment Basics

View skill →

Interpretable machine learning applications: Part 5

Interpretable machine learning applications: Part 5

GenAI news from Weights & Biases CEO, Lukas Biewald

GenAI news from Weights & Biases CEO, Lukas Biewald

Weights & Biases

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Amazon Web Services

Get Started with Raven AGI

Get Started with Raven AGI

Related Reads

The 5 Open-Source Coding LLMs You Should Be Running Locally in July 2026

Run 5 open-source coding LLMs locally to leverage their capabilities in coding tasks, surpassing closed counterparts

Integrating LLM with Existing Applications

Learn to integrate LLMs with existing applications by treating them as an infrastructure layer, not a product rewrite, to avoid vendor lock-in and scalability issues

My reasons to run local models

Run local models for finetuning, data privacy, and cost savings

Reddit r/LocalLLaMA

LLM Tokens Explained: Cost, Memory, Speed and Context Windows

Understand LLM tokens and their impact on cost, memory, speed, and context windows to optimize your language model usage

Chapters (13)

Introduction

1:41 The importance of an AI’s character

3:32 Training an AI model

5:44 System prompts

7:24 Impacts of system prompts

11:20 Character vs personality

16:39 Training good moral character

19:10 Claude's trait: charitability

25:08 Claude's trait: honesty

28:21 Deciding on Claude’s personality

31:11 Self awareness in AI

34:02 Kindness towards AI

37:25 Conclusion

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)