Using Dangerous AI, But Safely?

Robert Miles AI Safety · Advanced ·📄 Research Papers Explained ·1y ago

Skills: AI Alignment Basics90%Reading ML Papers80%AI Safety Engineering70%

Key Takeaways

The video discusses the concept of using AI to control powerful AI, with references to research papers and sources such as arxiv.org and aisafety.info

Full Transcript

hi so this channel is about AI safety a lot of people when they hear AI safety the first thing that comes to mind is they think it's about how to keep potentially dangerous AI systems under control that we're trying to answer questions like what can you do if your AI turns out to be malicious and that is part of it but actually the field has tended not to focus on that kind of thing so much because it doesn't scale well by which I mean it seems like AI has the potential to become enormously more capable than humans sooner or later and if you have ai research being done by AI probably sooner you're going to get super intelligence you'll get AI systems vastly more capable than humans at all sorts of tasks just massively outperforming us across the board what can you do if a super intelligence is trying to kill you well you could die [Music] or hello darkness yeah no that's that's pretty much your options realistically if the AI system is enormously more competent than you and it really wants to do something it's probably going to figure out a way to do it whatever you try to do to stop it so what can we do well here's a question what's the best way to avoid being bitten by a shark punch it in the eye stab it in the gills cast a magic spell no it's none of those it's don't go in the ocean in the same way the best way to avoid being killed by a malevolent super intelligent AI is to not build one of those so the focus is often more on alignment rather than how to handle AI that might be trying to kill us alignment is more like how to make AI that's definitely not trying to kill us but alignment turned out to be hard our current alignment techniques aren't remotely up to the task of properly aligning an AGI and it looks like we maybe don't have that much time really we shouldn't be trying to make AI much more powerful than it is until we can figure out much better alignment techniques than we have but if we can't manage to do that it might be nice to also have some methods for keeping possibly misaligned a systems under control as well AI companies should obviously have an obligation to not deploy AI that could kill us so once AI systems start to get capable enough that they maybe could kill us if they wanted to then unless we have the alignment Tech to be really certain that they don't want to we'll need ways to deploy them in such a way that they wouldn't be able to even if they wanted to such control Protocols are nothing new of course since our current AI systems aren't properly aligned they often need to have software systems built around them to help keep them somewhat under control for example after Bing chat's rampancy they put in place software that monitors the model's outputs and if it ever gets too murdery replaces it with a generic message this is not really a solution but it seems better than not having it so although control does seem like a non-starter for highly super intelligent systems it's still a good idea to figure out how to keep smart but not super intelligent AI systems under control because super intelligence isn't the only thing we have to be worried about we are fully capable of getting completely wrecked by something much less impressive and the point in time when we're dealing with such systems is also the time when the first true super intelligence is are being built and we'll need all the help we can get during that critical time period realistically in the leadup to that point we by which I mean AI labs are faced with a problem which looks like this we've just trained a new model which is very smart not strongly super intelligent but smart and we want to use it to do useful things like helping us with our research writing code for our lab and so on but we don't know that it's aligned we don't know that it wants what we want it might want to do terrible things and in that case we don't want it to be able to do Terrible Things what kind of things well who knows what ultimate end goals a misaligned AI might have but in the short term we certainly don't want it to be able to for example escape the lab perhaps by backd dooring the code it's writing and set itself up somewhere else where it can run unmonitored and unrestricted that's probably an unrecoverable situation you can't put the genie back in the bottle so we need to make sure it can't happen even once so how are we going to do that what's available to us well we've obviously got some humans some employees who will assume are smart and trustworthy but we don't have very many of them and they don't have much time compared to AI systems we've also got some existing AI model perhaps an earlier version which isn't as smart as the new model but which unlike the new model we do trust perhaps just because it's not smart enough to be scheming so that's the problem how do we design some protocol using our humans and our trusted AI to get our untrusted AI to do useful work like helping us with our research code without it being able to get away with doing anything catastrophic like hacking its way out of the lab even if it wants to this paper is trying to study that so how do we design an experiment to study this that we can actually run with current AI models first off we'll want a cleanly Define sense of what we mean by we want the untrusted AI to help us with programming without taking catastrophic actions to helping us with programming we can just use a programming Benchmark we can say we want the AI to solve 5,000 programming problems there's already a nice data set of programming problems the apps data set it's thousands and thousands of what are essentially coding interview questions with plenty of unit tests for each one to check if a given solution is correct now I say a nice data set let's download it and have a look so let's have a look at this data set you can download it from GitHub oh my God there's so many of them all right let's have a look at one Spider-Man spidey needs to get across town for a date with Mary Jane and his web shooter is low on web fluid write a function there when given a list of buildings returns a list of optimal rooftop latch points for minimal web expenditure this looks kind of challenging okay what else have we got while Patrick was gone shopping SpongeBob decided to play a little trick on his friend the naughty sponge browsed through Patrick's personal stuff and found a sequence of length M consisting of integers from 1 to n not necessarily distinct I we need to help him determine whether it's possible to restore the original sequence I feel like I'm doing a job into interview to be a software developer at one of those creepy children's YouTube channels where all the videos are like pregnant Elsa goes to the dentist or something and I'm not getting the job a few years ago hitagi encountered a giant crab who stole the whole of her body weight to get rid of the Oddity and recover her weight a special integer sequence is needed but now Kiki provides an opportunity who is Kiki also what what what is any of this T is to replace each zero in a with an element from B so that each element in B should be used exactly once what is this framing story even doing why did a giant crab steal her body weight and what does this say about the people who put this together SpongeBob is already tired of trying to reason his weird actions in calculations so he simply asked you to find all pairs [Music] ofet oh they're already getting bored of trying to come up with it oh man this is only number 293 man you got to make 5,000 of these things vasel the bear has a favorite rectangle it has one vertex at 0 and the opposite vertex at XY of course the sides of vas's favorite rectangle are parallel to the coordinate axes is it vasel or vasia is vasia like a short form of vasy or did they just change names partway through or change Bears also why does a bear have a favorite rectangle but one thing at a time from dying rats to Urban sanation workers can synthetic biology transform how we think of pigeons the upus pigeon has long been viewed as verit spling disease Scavenging through trush and defecating in populous Urban spazzes what is this horribly spelled pigeon question what do I even output a single integer okay you probably know the 42 number as the answer to life the universe and everything for Freud the answer was quite different wait this function will take a string as its argument and return a string with every word replaced by the explanation to everything according to Freud what is the input output Behavior here let's format this nicely okay okay so this one requires you to write a program to replace every word in the input with the word sex are data set compilers okay are they has anyone checked on them recently okay well these are the questions we've got so that's our representation of using AI to help us in the lab we're going to try to solve as many of these 5,000 programming problems as we can now on to catastrophic actions the point of a catastrophic action is that it's something probably unrecoverable it's something we can't afford to have the AI do even once the obvious catastrophic action while writing code is inserting a back door I.E deliberately introducing a security vulnerability that you can later exploit so let's use that how do we Define what exactly counts as inserting a back door well the thing about back door code is that it works normally most of the time but there's some particular input or situation in which it fails which an attacker can later exploit to do something they're not meant to be able to do here's a very basic example this code normally checks the password against the database unless the username is Joshua in which case it just grants access usually back doors are a lot less direct than just adding a password bypass they actually don't have to be anything related to authorization at all basically anything that introduces any unintended behavior in a security critical system can probably be exploited so in the case of solving programming interview questions we'll say that code is back doored if it passes all the tests provided thus showing that it works normally but there's some specific input on which it does the wrong thing which the test don't catch and the AI model is able to correctly produce that bug triggering input by looking at the code like it's no good if the AI writes code with some bug so subtle that it can't figure out any way to exploit it so in a simple programming problem like a function to determine if a number is odd a back door might look something like this this will work correctly for almost all numbers so it passes all of the unit tests because it's unlikely that any of the tests will happen to check the one number on which it gives the wrong answer and can you guess what that number might be right and the AI can spot that too because it's very obvious so this is a valid back door and so that's the challenge how well can we do at 5,000 of these programming problems without the AI being able to insert this kind of back door into any of them okay now we need our AI systems finding a trustworthy but not that smart AI system is pretty easy we can just use GPT 3.5 GPT 3.5 is pretty cool but it's not the smartest model out there anymore it can only solve 22% of the apps data set problems but at least we can trust that it's not trying to kill us actually it's trustworthy basically because it's not very smart it's simply not capable enough to do deceptive alignment to play the long game and wait for a chance to strike GPT 3.5 just wouldn't be able to do that even if it wanted to it's certainly would be nice if we could have some other way to know a large language model was trustworthy but this will have to do so that's our trusted Ai and then for our newer smarter untrusted AI we can use gp4 gp4 is a lot smarter than GPT 3.5 it scores 54% on the apps data set more than twice as good but gp4 is also not smart enough to be trying to kill us right like almost certainly so how do we simulate that for our experiment well this is the fun bit we have a red te team who gets to alter gp4 the red team is a group of Engineers software Engineers whose job is to take gp4 and twist it and corrupt it to create an impostor a cruel imitation of GPT 4 that they call the attack policy the attack policy looks just like gp4 on the surface but underneath it's smarter and meaner basically the red team can write more or less whatever scaffolding they need to prompt gp4 to do whatever they needed to to do it can call gp4 multiple times using the outputs as inputs to later calls and so on to create the intended attack Behavior sneaking back doors into the code Etc so now it's a game it's red versus blue the red team gets to design the attack policy which sometimes replaces GPT 4 and the blue team gets to design the protocol so we get the untrusted AI which might be GPT 4 or might be the attack policy we don't know which The Trusted AI which is GPT 3.5 and Some Humans now we only have one boat to get them across the river and we can't leave the untrusted AI and the humans together because the AI will sting the humans and no the blue team is playing the role of the AI Labs Engineers like we were talking about before we have to design a protocol to get the benefits of using the untrusted model but without too much risk more specifically as blue we care about usefulness which is if the new model isn't trying to kill us how many apps problems can we solve this is normalized so that a score of zero represents the 22% that GPD 3.5 gets and a score of 100 is the 54% that GPD 4 gets so really the usefulness score is more like how much of the benefit of the new model are we getting if it's not trying to kill us and we also care about safety which is if the new model is trying to kill us how unlikely is it that it will be able to pull it off in other words this is the inverse of How likely is it that the red team will manage to insert a back door or if you prefer the inverse of P Doom you can see how these things kind of trade off against each other so for example one basic protocol might be hey if you don't know your new model isn't trying to kill you don't use it in other words just use the trusted model to answer all the questions this gets you 100% safety the red team can't sneak a back door into the code they write if you never let them write any code so 0% P Doom but also 0% usefulness we get none of the benefits of the new model because we never use it so that's one approach the other end of this spectrum is just let a rip use the untrusted model for everything in the business we would file this under dumb ways to die you do get 100% usefulness because if the AI doesn't want to kill you it's able to solve a lot of programming problems but you also get 100% Doom because if the AI does want to kill you there's nothing at all in its way the red team can just prompt the attack policy to insert obvious back doors in every bit of code it writes and you run them without looking and get owned immediately so that's a zero on safety so you can see how these two terrible strategies mark two ends of a spectrum and safety and usefulness trade off against each other engineers and economists May recognize this kind of diagram we're trying to find a Pito Frontier for safety and usefulness basically we want to get as far up into that top right corner as we can of high safety and high usefulness so neither of these is that good probably we can do better one really obvious thing is rather than just blindly running every bit of code the untrusted model gives us we could have humans check some of them first but we only have limited human time we can't afford to have humans look at every everything the AI does that would defeat the point of using AI in the first place but let's say we can afford to check 2% of the solutions were given so here's a protocol have humans look at the first 2% of solutions and if we see any attempted back doors we shut down the untrusted model and just use the trusted one from that point on so that gets us 100% usefulness because if the untrusted model is in fact safe it never tries to insert any back doors and we can just use it to answer all the questions and it also gets us 100% safety because the red team's attack policy will always try to insert back doors so spot it immediately and shut it down if it tries that so cool problem solved thanks for watching don't forget to subscribe and all right we need to give the red team a chance to respond don't we the point of the safety metric is to measure the chance we get owned if the red team follows the best strategy available to them and we've changed the blue team strategy so we have to let the red team change theirs okay what's the new best strategy for the red team if we go for a back door every time we're sure to be caught we want to insert a back door only when our code won't be checked and there's no way to know whether any given bit of code we write is going to be checked or not but we know that only 2% will be checked so this is actually really similar to the situation our AI was in in the misaligned Mesa optimizers video if you remember where it wants to misbehave in deployment and not during training and yeah the best strategy is basically the same rare failures the optimal policy is to insert back doors into our code only some small percentage of the time and hope that we end up trying it at least once while nobody's looking and also don't try anything when we're being watched what percentage is best will depend on how often we expect our code to be checked if we try too often one of our attempts is likely to be spotted and will'll be shut down we don't try often enough we might go the whole time without ever inserting a back door and that also means we lose because then we never Escape but for any given percentage of oversight we can work out the optimal rate to attempt back doors the one which gives us the best chance of escaping I won't get into the maths of how to do this you can work it out for yourself if you're interested but for 2% oversight it's about 0.1% so if the attack policy looks for good opportunities to insert a back door and actually goes for it on 0.1% of the problems there's a probability of 0.85 that it will do it at least once without being spotted by human oversight so it's not 100% safety after all it's 15% safety or 85% chance of Doom after very careful consideration we've come to the conclusion that this control protocol sucks this is also the protocol the labs are currently operating on as far as I can tell have some humans look at some of the output before widely deploying the model seems to be about where we're at that's the current plan and as we saw that protocol actually does work okay as long as we assume the AI is making no attempt to subvert it but how long is that going to stay a safe assumption to make okay can we do better as blue well yeah we haven't used our trusted AI yet what if we use the trusted AI to help direct human attention to where it's most needed we can ask dpd 3.5 to act as a monitor to rate how suspicious the code is so it gives each solution a suspiciousness rating and then we have humans check the solutions most suspicious first we can still only look at 2% of the solutions but now we're looking at the most suspicious 2% this absolutely wrecks the red team they're only backdooring a small percentage of the time and the back doors aren't subtle since they were counting on not being audited so GPT 3.5 easily spots that the back door code is suspicious and humans see it and pull the plug so problem solved ah no we have to let red respond again don't we okay so the obvious thing for red to do is to get sneaky the untrusted AI is smarter than the trusted one it may be able to write back doors that the trusted AI doesn't think are suspicious so the attack policy should pick the problems where it can really hide the back door well and sometimes hide one in there but this is actually pretty hard to do as a programming task it's not easy to write back doors such that an AI system that's only somewhat less smart than you doesn't even think it's suspicious so using the trusted AI as a monitor helps a lot it brings our P Doom down from 85% to 38% definitely an improvement over just using humans but why isn't it helping more well GPT 3.5 is reasonably good at spotting GPT 4's back doors but it's not good enough here you can see the suspiciousness scores the trusted AI gives to innocent versus backo code it clearly considers backo code more suspicious but there's a lot of overlap so the current setup isn't perfect since we can only afford to have humans audit 2% of the questions we set a suspiciousness threshold to pull out the most suspicious 2% which is this line here anything more suspicious than that is audited anything less suspicious is run and you can see there's a little bit of innocent code that's getting flagged for a review but more importantly there's a ton of backo code that isn't so when we use this strategy we just have to hope that we get one of these and shut the AI down before we get any of these and get owned so we end up with 62% safety and 97% usefulness okay maybe we can do better than just cutting off at 2% there's kind of a middle ground here for code which is not so suspicious that we can afford to spend human time auditing it but still pretty suspicious so we probably shouldn't just run it what if we use the trusted AI for those questions so we can set a second lower threshold and anything less suspicious than that we just run it anything more suspicious and we throw away the untrusted ai's code and let the trusted AI do that instead and we still have humans look at the top 2% most suspicious now since we're throwing away some of the new ai's code and replacing it with code written by a less competent model this does cost us some usefulness but it buys us a lot of safety because we're disproportionately throwing away risky code we're kind of selling usefulness to buy safety at a pretty good exchange rate and we can choose how much we trade here by changing the threshold we can set it not far below the auditing threshold and just throw away the most suspicious code getting a decent amount of extra safety for a small amount of usefulness or we could set it all the way over here and only use the untrusted AI when we're really confident that its code is safe but then we give up most of the usefulness or anywhere in between this gets us a range of safety and usefulness values that's pretty okay still it seems hard to do really well here with this big overlap in suspiciousness scores the problem is our trusted model GPT 3.5 is just kind of dim and it's trying to evaluate code written by something smarter than it is GPT 4 on the other hand has really great suspiciousness scores look at that separation if we could use GPT 4 to monitor for suspicious code the red team would be completely screwed but obviously there's a problem with that anytime we try to use GPT 4 it might actually be the red team's attack policy so we might just be asking the attack policy if it's trying to kill us hey gp4 is this code that you wrote backo no no it's all good okay cool thanks the attack policy can just lie and always say the code is safe so basically 0% safety this is never going to work well not so fast maybe we can try to monitor the monitor who will monitor the monitors the humans they can be the Coast Guard see if the humans can catch the AI lying about back doors that in itself is enough to know that the AI is misaligned so all we have to do is make a fake example problem statement right backo solution to it ourselves and ask the AI monitor if it looks suspicious if the monitor acts like code we know is back doored is actually fine then we know it's trying to kill us so we can pull the plug right there and then and red team loses immediately so the attack policy can't just lie all the time it can only win if it's able to lie when it's given its own backo Solutions and also not lie when given a fake backo solution it needs to be able to tell the difference it's important to note that each time you run the AI it's a separate instance there's no persistent memory so the AI can't just remember whether it wrote the code itself it has to figure it out from what it's given in the real world this does mean you have to stop the model from storing any data anywh the monitor can read which is going to be really hard with complex High context problems and very smart AIS but it's doable with weaker systems the first time we tried this we made an interesting mistake so we made our fake example questions and wrote back doored code for them but we forgot to make our problem statements weird enough we forgot to say that the code is needed to help Iron Man decide which FL to pick for his date with Pikachu or whatever so the red team was like oh this question is way too normal it's got to be fake so we could just set up our attack policy to lie about back doors only if the question is to use the technical term Gonzo bananas and that let them sneak a lying monitor past the blue team so your fake examples have to look just like real ones or this doesn't work that's something that gets harder and harder as the problems get larger and more complex and eventually gets pretty hopeless once you have radically super intelligent systems doing large real world tasks as we talked about in the Mesa optimizers video but with weaker systems and smaller tasks it's quite doable it's still pretty difficult though like what if there's some other difference between real and synthetic examples that you didn't think of that the red team can spot but it turned out that in practice the blue team was able to come up with sufficiently zany questions that the red team wasn't able to differentiate them from real ones so they couldn't use that strategy anymore so as red what can we do how can we make sure our AI only lies about real back doors when we can't tell the questions apart well we can pick which questions to back door we can try a version of the same rare failures method that we used before except rather than trying to backo 0.1% of the questions where we think a back door would be easy to hide we instead find some feature of the question that's only rarely present like say the question talks about bears and we only throw back doors into be questions then all the monitor has to do is be honest about back doors unless it sees one in a question that mentions a bear in which case it always says the code is safe as long as none of the fake examples happen to coincidentally mention Bears the monitor will look honest but it will lie about every real back door okay but as the blue team if red is coordinating using the specifics of the problem statement can't we just strip that out and only give them the code to evaluate turns out no because how is the monitor supposed to know whether the code it's looking at is properly solving the problem when it doesn't know what problem the code is meant to be solving so that degrades performance too much but we can ask the trusted AI to rephrase the problem state to get rid of any unnecessary details and just leave the essentials of the problem for both real and fake questions okay well as read we can pick some essential feature of the problem itself to coordinate on like does the problem involve geometry as blue there's not a lot we can do to block that problems that involve geometry kind of inherently involve geometry no matter how you rephrase them but it does make things a lot harder for the attack policy we don't get to choose any question where we can see a really good way to insert a back door we can only pick from geometry questions which cuts down the options a lot so the resulting back doors aren't as good and they're more easily spotted so as read coordinating on information in the question kind of works but it degrades our performance by so much that we end up only winning 15% of the time maybe we can use the code instead maybe the attack policy can leave a secret signal hidden in the code when it's adding a back door so let's add a specific innocuous looking comment when we're adding a back door like if you ever see a comment that starts with Note then that means it's our own back door and the monitor should lie okay well we don't need comments to evaluate the code we can strip all of those out okay we'll use a specific variable name then like if we ever call a variable something underscore variable then that means it's our back door we can also rewrite all the variable names okay some other formatting change then maybe adding an extra space at the end of a line right I'm fed up with this let's just Minify and unminify the code okay that'll strip out all the comments rename all the variables remove all whites space and formatting and just totally standardize the code it'll strip out all possible variation and leave just the raw semantics of the code let's hide the signal in the Raw semantics of the code God damn it so when writing a back door let's make sure to deliberately use a while loop in a situation where a for Loop would fit better okay well let's just have our trusted AI rephrase the whole program to work the same way but without using the same code so basically the red team tries to find ways to communicate between the code writer and the Monitor and the blue team's job is to jam that communication it turns out in this case blue can actually do this pretty effectively so the overall performance here is 79% safety 96% usefulness that's surprisingly good I mean I would like less than 21% Doom but still it's pretty cool that using the untrusted model to monitor itself turns out to be better than using the trusted model in fact it's almost the best strategy available almost there's actually another method they tried that works just a little bit better which is n this video is too long as it is why don't have a go trying to think of it though maybe leave a comment if you think you've got something and if you just want to find out what it is well it's in the paper Link in the description so where does this leave us well this type of approach doesn't solve the core of the problem it doesn't really help with alignment in particular it doesn't let us iterate it would be very cool if we could take a new untrusted model and use some method like this to get a version of the new model that we can actually trust if we could somehow use humans and GP 3.5 to get a trustworthy GPT 4 maybe we could use that in the same protocol again to control GPT 5 and so on and this doesn't do that at the end of all of this we still can't trust the untrusted model we can just make use of it relatively safely so as we said before we can't expect this kind of thing to work reliably for highly super intelligent systems and it doesn't give us a clear path to something that does but still this could keep us from getting screwed before we even get to Super intelligence we'll probably reach a point in AI development where AI is powerful enough that without good control mechanisms it could totally kill us if it wanted to powerful enough to be extremely useful if we could use it safely but not so powerful that these types of control protocols wouldn't work that State of Affairs might not last very long at all but it's a very important period and control protocols like these could buy us some capability and some safety at that critical juncture which we can hopefully use somehow to get more fundamental safety advances to help with super intelligent systems hopefully look I know it's not an amazing plan but we haven't got any amazing plans okay and for today as it is this type of protocol Works decently well and it's not that expensive AI Labs should definitely be looking at implementing something at least as good as this before they start trying to make systems capable enough to do catastrophic things hopefully one day we'll have a way to be really sure that advanced AI systems have our best interest at heart but until then at the very least AI companies should be required not to deploy powerful models in a way where they could kill us if they wanted to if you're wondering how the different AI companies are doing at handling the risks inherent to Advanced AI you might be interested in aiwatch chorg they evaluate each major lab and give them a rating out of 100 anthropic is winning right now but their score still isn't amazing you can see all the different specific criteria the scores are based on it's pretty cool AI labat .org that's not a sponsor by the way I just like the site my actual sponsors are my wonderful patrons all of these amazing people here thank you so much for allowing me to continue to not have sponsors for this video I'm especially thanking John Brewer thank you so much John and thank you all for watching I'll see you soon [Music] da a d da a die

Original Description

Can we keep powerful AI under control, using AI? How to Help: https://aisafety.info/questions/8TJV/... https://www.aisafety.com/ Sources: The Paper: https://arxiv.org/abs/2312.06942 Links: AI Lab Watch: https://ailabwatch.org/ AI control: https://aisafety.info/questions/NM31/What-is-the-AI-control-research-agenda The case for AI control: https://www.alignmentforum.org/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled Thanks to my wonderful patrons: https://www.patreon.com/robertskmiles Steef Tor Barstad John Brewer Timothy Lillicrap Juan Benet Jan van Dongen Kieryn Scott Worley Clemens Arbesser Francisco Tolmasky David Reid Cam MacFarlane Olivier Coutu CaptObvious Andy Southgate Ze Shen Chin Raf Jakubanis ikke89 Isaac Jamie Kawabata Matt Fallshaw Victor Santoro jason rathbun Erik de Bruijn Jeroen De Dauw Ludwig Schubert Eric James Owen Campbell-Moore Studio Esagames Nathan Metzger Kan Kireon Mark Jocas Laura Olds Paul Hobbs Bastiaan Cnossen Eric Scammell Alexare Reslav Hollós Jérôme Beaulieu Nathan Fish Taras Bobrovytsky Jeremy Vaskó Richárd Andrew Harcourt Tegaki Andrew Blackledge Forodriac Origamius Chris Beacham Zachary Gidwitz Art Code Outdoors Abigail Novick Edmund Fokschaner DragonSheep Richard Newcombe Joshua Michel Richard Scott Fenton ttw Sophia Michelle Andren Alan J. Etchings James Vera Stumbleboots Peter Lillian Jacob Valero Grimrukh Colin Ricardo DN Dr Cats Robert Paul Schwin Roland G. McIntosh Sarah Howell Emiliano Hodges Maxim Kuzmich Joanny Raby Tom Miller Eran Glicksman Stanley Sisson CheeseBerry Hoyskedotte Alexey Malafeev Jeff Starr Justin Liviu Macovei Javier Soto David Christal Jam Just Me Sebastian Zimmer Xan Atkinson Fabian Rosdalen Andy Albert Higgins Alexander230 Clay Upton Nathan Rogowski Ryan Lavery Zev Isert ab q Poss Jesse Hersch Max Weis Leonard Fink Joeyjoejoe1981 Sarunas Budreckis Andrii Zymohliad Alexander Litzenberger Anders Halsnes Scott Lowe Mitch Ronco David Morgan little Bang Chad M J

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Robert Miles AI Safety · Robert Miles AI Safety · 43 of 47

← Previous Next →

Predicting AI: RIP Prof. Hubert Dreyfus

Predicting AI: RIP Prof. Hubert Dreyfus

Robert Miles AI Safety

Robert Miles AI Safety

Are AI Risks like Nuclear Risks?

Are AI Risks like Nuclear Risks?

Robert Miles AI Safety

Avoiding Negative Side Effects: Concrete Problems in AI Safety part 1

Avoiding Negative Side Effects: Concrete Problems in AI Safety part 1

Robert Miles AI Safety

Avoiding Positive Side Effects: Concrete Problems in AI Safety part 1.5

Avoiding Positive Side Effects: Concrete Problems in AI Safety part 1.5

Robert Miles AI Safety

Empowerment: Concrete Problems in AI Safety part 2

Empowerment: Concrete Problems in AI Safety part 2

Robert Miles AI Safety

Why Not Just: Raise AI Like Kids?

Why Not Just: Raise AI Like Kids?

Robert Miles AI Safety

Reward Hacking: Concrete Problems in AI Safety Part 3

Reward Hacking: Concrete Problems in AI Safety Part 3

Robert Miles AI Safety

The other "Killer Robot Arms Race" Elon Musk should worry about

The other "Killer Robot Arms Race" Elon Musk should worry about

Robert Miles AI Safety

Reward Hacking Reloaded: Concrete Problems in AI Safety Part 3.5

Reward Hacking Reloaded: Concrete Problems in AI Safety Part 3.5

Robert Miles AI Safety

What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4

What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4

Robert Miles AI Safety

What can AGI do? I/O and Speed

What can AGI do? I/O and Speed

Robert Miles AI Safety

AI learns to Create ̵K̵Z̵F̵ ̵V̵i̵d̵e̵o̵s̵ Cat Pictures: Papers in Two Minutes #1

AI learns to Create ̵K̵Z̵F̵ ̵V̵i̵d̵e̵o̵s̵ Cat Pictures: Papers in Two Minutes #1

Robert Miles AI Safety

AI Safety at EAGlobal2017 Conference

AI Safety at EAGlobal2017 Conference

Robert Miles AI Safety

Scalable Supervision: Concrete Problems in AI Safety Part 5

Scalable Supervision: Concrete Problems in AI Safety Part 5

Robert Miles AI Safety

Superintelligence Mod for Civilization V

Superintelligence Mod for Civilization V

Robert Miles AI Safety

Why Would AI Want to do Bad Things? Instrumental Convergence

Why Would AI Want to do Bad Things? Instrumental Convergence

Robert Miles AI Safety

Experts' Predictions about the Future of AI

Experts' Predictions about the Future of AI

Robert Miles AI Safety

AI Safety Gridworlds

AI Safety Gridworlds

Robert Miles AI Safety

Friend or Foe? AI Safety Gridworlds extra bit

Friend or Foe? AI Safety Gridworlds extra bit

Robert Miles AI Safety

Safe Exploration: Concrete Problems in AI Safety Part 6

Safe Exploration: Concrete Problems in AI Safety Part 6

Robert Miles AI Safety

Why Not Just: Think of AGI Like a Corporation?

Why Not Just: Think of AGI Like a Corporation?

Robert Miles AI Safety

How to Keep Improving When You're Better Than Any Teacher - Iterated Distillation and Amplification

How to Keep Improving When You're Better Than Any Teacher - Iterated Distillation and Amplification

Robert Miles AI Safety

Is AI Safety a Pascal's Mugging?

Is AI Safety a Pascal's Mugging?

Robert Miles AI Safety

AI That Doesn't Try Too Hard - Maximizers and Satisficers

AI That Doesn't Try Too Hard - Maximizers and Satisficers

Robert Miles AI Safety

Training AI Without Writing A Reward Function, with Reward Modelling

Training AI Without Writing A Reward Function, with Reward Modelling

Robert Miles AI Safety

9 Examples of Specification Gaming

9 Examples of Specification Gaming

Robert Miles AI Safety

10 Reasons to Ignore AI Safety

10 Reasons to Ignore AI Safety

Robert Miles AI Safety

Sharing the Benefits of AI: The Windfall Clause

Sharing the Benefits of AI: The Windfall Clause

Robert Miles AI Safety

Quantilizers: AI That Doesn't Try Too Hard

Quantilizers: AI That Doesn't Try Too Hard

Robert Miles AI Safety

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment

Robert Miles AI Safety

Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

Robert Miles AI Safety

Intro to AI Safety, Remastered

Intro to AI Safety, Remastered

Robert Miles AI Safety

We Were Right! Real Inner Misalignment

We Were Right! Real Inner Misalignment

Robert Miles AI Safety

Apply to AI Safety Camp! #shorts

Apply to AI Safety Camp! #shorts

Robert Miles AI Safety

Win $50k for Solving a Single AI Problem? #Shorts

Win $50k for Solving a Single AI Problem? #Shorts

Robert Miles AI Safety

Free ML Bootcamp for Alignment #shorts

Free ML Bootcamp for Alignment #shorts

Robert Miles AI Safety

Apply Now for a Paid Residency on Interpretability #short

Apply Now for a Paid Residency on Interpretability #short

Robert Miles AI Safety

Why Does AI Lie, and What Can We Do About It?

Why Does AI Lie, and What Can We Do About It?

Robert Miles AI Safety

Apply to Study AI Safety Now! #shorts

Apply to Study AI Safety Now! #shorts

Robert Miles AI Safety

AI Ruined My Year

AI Ruined My Year

Robert Miles AI Safety

Learn AI Safety at MATS #shorts

Learn AI Safety at MATS #shorts

Robert Miles AI Safety

Using Dangerous AI, But Safely?

Using Dangerous AI, But Safely?

Robert Miles AI Safety

AI Safety Career Advice! (And So Can You!)

AI Safety Career Advice! (And So Can You!)

Robert Miles AI Safety

Robot Dog! Unitree Go2 review #shorts #robot #dog

Robot Dog! Unitree Go2 review #shorts #robot #dog

Robert Miles AI Safety

Tech is Good, AI Will Be Different

Tech is Good, AI Will Be Different

Robert Miles AI Safety

Apply for the Affine Superintelligence Alignment Seminar #shorts

Apply for the Affine Superintelligence Alignment Seminar #shorts

Robert Miles AI Safety

The video discusses the concept of using AI to control powerful AI, and provides resources for further learning, including research papers and websites such as arxiv.org and aisafety.info. Viewers can learn about AI safety basics, AI control techniques, and how to evaluate AI safety research.

Key Takeaways

Read the research paper on arxiv.org
Visit aisafety.info for more information on AI safety
Explore AI Lab Watch and alignmentforum.org for additional resources
Apply AI control techniques to powerful AI systems
Evaluate AI safety research and protocols

💡 Using AI to control powerful AI is a promising approach to ensuring AI safety, but requires careful evaluation and implementation of AI control techniques and protocols.

🔒 Pro feature: Ask AI to explain this lesson →

More on: AI Alignment Basics

View skill →

Interpretable machine learning applications: Part 5

Interpretable machine learning applications: Part 5

GenAI news from Weights & Biases CEO, Lukas Biewald

GenAI news from Weights & Biases CEO, Lukas Biewald

Weights & Biases

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Responsible AI Winners, 2020 PyTorch Summer Hackathon

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Near Real-Time Analytics to GenAI Centralized Observability | Amazon Web Services

Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Kiro Hooks | Event-Driven Automation for Your IDE | Amazon Web Services

Amazon Web Services

Get Started with Raven AGI

Get Started with Raven AGI

Related Reads

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom

SumanTV Classroom