OpenAI Spinning Up in Deep RL Workshop

OpenAI · Beginner ·🛠️ AI Tools & Apps ·7y ago

Skills: LLM Foundations90%LLM Engineering80%Agent Foundations70%

Key Takeaways

The OpenAI Spinning Up in Deep RL Workshop covers the basics of reinforcement learning, deep learning, and their applications in areas like robotics and natural language processing. The workshop discusses various techniques, including Q-learning, policy gradients, and domain randomization, and highlights the importance of human feedback in shaping reward functions.

Full Transcript

there hello is your La yeah yeah it's connect okay hey good to see you welcome yes this will be fun yeah um I saw I met the scho uh yes uh we had a system and then the system went down for maintenance so we don't long to have a system um but yeah if's by the front then you'll be able to just get a name tag your name good morning everybody please take your seats we're about to get started uh hello and good morning everyone hi uh I'm Josh akam I'm a safety researcher here at open Ai and I'm the main author of spinning up in deep RL and uh thank you all so much for being here today at open ai's first spinning up Workshop uh for people who are tuning in on the live stream uh I'd like to let you know that there is a minor technical difficulty and so we will not be able to broadcast the slides directly from my computer uh into the live stream video so you'll be seeing the screen uh through the camera in the event that that's not uh enough for you to to see it clearly um I just open sourced the repo that has the PDFs for these slides so please go to github.com /op spinning upy workshop and you'll find in the RL intro folder uh RL intro. PDF which will be the presentation that I'm about to give so hopefully that makes it easier for you to follow along so since this is kind of a new thing that we're doing I'd like to start today by talking about what it is and why we're doing it and what we hope you get out of it from being here education at open a is this concept that as part of our mission we want to make sure that we provide for the public good and that we help Foster a global Community around uh AGI which is the thing that at open AI We Care the most about and are trying to figure out how to uh make sure happens in a way that's safe and beneficial for all of humanity so for those of you who aren't already familiar AGI is artificial general intelligence the idea is that this is going to be some very powerful AI technology that'll have the ability to change pretty much everything about how we do anything uh something that could potentially do most economically valuable work something that could solve tasks that currently only human intelligence is capable of solving and so we think it's really important that we help people become aware of what AGI is and what the technology uh that'll likely underly it is so that you can think critically about issues that might come up in the future and also if you're interested participate because we really need for people to step up and help make sure that this technology is safe and uh does what we want it to do and doesn't cause anything harmful or detrimental to the world so uh spinning up is the first thing that we're launching under this education at open a initiative and the goal is to help people acquire technical skills uh in the research topics that we care about so spinning up in deep RL is a resource that hopefully all of you have seen by now um it contains a number of different pieces including a short intro to reinforcement learning so what is this thing that we're doing so much research about at openai an essay about how you would go about becoming a researcher if you're interested in joining um a curated list of important papers in the field so this is particularly important because since this is an emerging field there isn't really a clear consensus on the best way to learn it or a textbook that completely illuminates the way from start to finish uh and a lot of the important knowledge right now is still in research papers so if you want to find out the most stuff about this you have to go digging and hopefully this helps you figure out where to look also a code repo of key algorithms because for any of you who have tried hacking in this field before I'm sure you found that there were a lot of very confusing resources out there really excellent ones but nonetheless ones that uh made non-obvious choices and didn't clearly connect what they were doing to why they were doing it and so we hope that the repo that we provide in spinning up in De RL is part of uh something to bridge the gap there and of course some exercises so if you want to actually try coding something up um there are a few ideas there for for what to do to get you familiar with some of the key pieces of math or algorithms or what kind of bugs you might expect and uh so why are we having workshops so in addition to putting these resources online we think it's going to really help people if we work with you one-on-one if we can see you face to face and talk with you and have the kind of conversations and share the ideas that just you know don't come up in the sort of open loop control thing that happens when we put information on the internet um today we'd like to have you come away from this with a better sense of what the current capabilities and limitations are in deepl um tell you a little bit about what kind of research is out there so if you want to go and follow some uh line of thinking you know what's been done and and what hasn't and we'd like you to actually try building and running uh algorithms for deep reinforcement learning for possibly for the first time and show you how to be confident in doing that so that if you want to keep doing it afterwards you're able to all right so then what is deep reinforcement learning why do we need it why do we care about it uh deep reinforcement learning is the combination of reinforcement learning with deep learning RL reinforcement learning is about solving problems by trial and error and deep learning is about using these uh very powerful function approximators called Deep neural networks to solve problems and deep reinforcement learning is just straightforwardly the combination where we're going to have something that's learning by trial and error and the thing that's getting learned is a deep neural network that's going to make some kind of decision or evaluate some situation and uh use that ultimately to in some environment uh make decisions that lead to rewards where reward is just some measure of how good or bad an outcome was so when would you want to use RL uh RL is useful when for one there's a sequential decision- making problem uh two you don't know what the right thing to do in that situation is already if you have the optimal Behavior Uh say from having watched human experts enough and you have just a ton of data on exactly what to do in every situation then you can use the standard uh tools of say supervised learning to exactly get some machine Learning System to duplicate that behavior but when you don't have access to that or when you suspect that what appears to be expert human behavior is in fact suboptimal in that situation you may want to try reinforcement learning instead because it could discover things that uh wouldn't have otherwise been known and you also have to be able to evaluate whether or not a behavior or an outcome was good or bad this is pretty critical so RL is good when it's easier to evaluate behaviors than to generate them or to exactly solve for them and when would you use deep learning so the typical uh Paradigm for deep learning is that you want to approximate some very complicated function a function that usually requires some amount of intelligence so for instance uh if a human looks at a picture of a bird and then knows what species of bird that is that's a thing that you can't really write down a simple mathematical rule to do if you want to get a machine to do that you have to teach it from data and uh other problems that you know you would want to do this for typically have inputs or outputs that are very high dimensional um because it's just quite hard to from an image or from a video stream or from an audio stream go to a decision rule uh without some s of learning in the middle and also you typically want to have LS and LS of data because getting machine Learning Systems to behave in any reasonable way uh requires that you give them sufficient examples and there are tons of problems where this is exactly what you have and in those domains deep learning has been very successful at uh exceeding whatever was previously the State ofthe art from any other methods that existed before and creating things that are now standard consumer products uh things that were Magic 10 years ago are like completely normal now the idea that we have super excellent image recognition facial classification uh that you can talk to your phone and it's going to know what you said and it's not just going to come up with some completely random gobbledygook uh this is getting better because we're able to leverage this very powerful technology that is deep learning for these problems and so deepl is when you have some very hard high-dimensional problem where you can evaluate behaviors and you want to get a machine to learn how to do it because you can't write down how it should in fact behave and some very simple examples of this are uh say video games where you want to go from uh a computer looking at an image of the screen so just raw pixels to a decision rule that scores the most possible points in the game or behaves in a way which is cool or interesting or exciting uh or perhaps a really sophisticated strategy game like go where really deep thinking intuition and creativity is necessary to make progress you can't write down a simple rule for that but you can learn it with reinforcement learning uh or perhaps you want to control some complex humanoid um some some robot to run around and do stuff um or maybe something which is a little less silly maybe a little more real maybe you want to get robots in a factory to quickly learn a new task uh when the robot Uprising happens it's because of this we're very sorry for this research this was trained by the way with an algorithm that was developed here at open a called uh proximal policy optimization it's one of the algorithms in spinning up and if you haven't had any experience with it then uh we won't get into it in this lecture today but um any other point in the afternoon during the hackathon happy to go into detail so before we proceed into the RL specific stuff um this is a crowd with pretty wide range of backgrounds and so I just want to do a a very brief recap of some of the patterns from Deep learning what do you expect when you set up a deep learning problem what does that look like what do you have to think about so we typically talk about it in terms of uh the language of finding a model that is able to give the right outputs for certain inputs so in this case the model is going to be some function of the inputs and parameters and the parameters are adjustable we control them we change them and we want to change them in a way that's going to make the model behave According to some design specification the way that we provide the design specification and get the parameters to satisfy It Is by setting up some kind of loss function this tells you in a nutshell how good the model is at doing the thing that you want it to do usually some measure of just how close the output from the model is to the desired output and the critical thing about this loss function is that it has to be differentiable with respect to the parameters in the model and when you have that set up oh and of course there's data as well so you have a bunch of different examples of inputs and outputs and your loss function reflects how well your model performs across all of them typically as just some average over per data point losses so with this setup you can then proceed to find the optimal model through gradient descent the idea is that the gradient is a mathematical object that tells you how much the loss changes in response to a change in the parameters and then you want to knowing that change the parameters in a way which is fruitful that is it reduces the loss it reduces the measure of error uh so what makes deep learning deep what is the Deep part it's this idea that function composition is at the core of the models that we make and that we consider so function composition just means that you have a bunch of different parameterized functions and the outputs of one are the inputs to the next one and you can arrange these in many different topologies we'll call these architectures for neural networks um the very simplest kind is just one where you have an input layer and then there is a matrix that multiplies that and then you maybe add some bias uh to that vector and then you pass that through a nonlinear activation function uh typically this is going to squash the outputs from that first linear transformation into something which Maybe is in the range from 0 to 1 or uh 0 to Infinity something relatively simple but that nonlinearity happens to do a lot of work and then when you have successive layers what it allows the model to do ultimately is uh represent successively more complex features internally so you might think of the output of each layer as being a new representation of the original input which has maybe rearrange the information in a way which uh is easier for some kind of final decision uh making procedure at the end of the network to uh to make the right decision based on um aside from that very simple model there are also uh substantially more complex ones so the other two diagrams on this slide are for uh lstm networks so that's in the lower left and the Transformer Network that's on the right an lstm network is a recurrent neural network the idea is that this is the kind of network that can accept a Time series of inputs and produce a Time series of outputs and internally it has some very complicated mechanisms for making sure that information gets propagated effectively across time steps in a hidden state so that when you make a decision somewhere in the future you can remember something that you saw in the past and then you can uh update the network in a way which is stable and reasonable the Transformer network is substantially more complicated and it allows networks to uh do something called attending over their various inputs so attention is something which uh is a concept that we can all kind of relate to when we look at the world we don't actually process literally every piece of data that we take in concurrently we particularly attend to whatever happens to be say in the center of our field of view or whatever we're thinking about at the moment uh whatever is most urgent and attention neural networks are able to basically do that when they make some decision on the basis of a lot of data they can select out the most most important pieces of the data for making particular kinds of decisions and that turns out to be very helpful in practice uh a few other things about deep learning and this is mostly just I'm checking off some boxes if you want depth on this I strongly recommend that you go see the uh spinning up essay where there are a bunch of links to papers and other resources that will give you detailed information about this um but to check off the boxes uh we might talk about regularizers so the idea is that sometimes optimizing your loss function picking the model that actually gives the lowest value of your loss function may not be the best thing to do you may wind up with a phenomenon called overfitting where you've made your model behave perfectly with respect to the data that you showed it but then it does a terrible job when it's given any other data because it learned a decision rule which was entirely too specific but with regularization you trade off the loss against something which has nothing to do with performance on the particular task but just kind of says hey cool your jets a little bit don't be so Avid about uh satisfying that objective and then it turns out that regularization actually leads to models that do a better job of generalizing to unseen data uh then there are also a couple of things that make the optimization process smoother and easier so you might do some kind of normalization technique where internally there's some output in the of the network where it's good to adjustably rescale that and shift it around and that's better than just letting the network do whatever it would have done if you didn't do this kind of normalization it's sort of spooky and there are some legitimate complaints inside the community about whether or not we really understand why this helps um but it seems to so it's worth knowing about also you might use a more powerful Optimizer than standard gradient descent this comes up also in reinforcement learning actually many of the things that we''ve been talking about in these past few slides show up in deep reinforcement learning which is why I'm bringing them up uh adaptive optimizers do something special in figuring out how to tune the learning rate the amount by which you change each parameter at each step of updating uh in a way which leads to typically faster convergence so you get to the the optimum point a little bit sooner or a little bit easier there's also the reparametrization trick but that's quite complicated and so we won't actually talk about it it's on the slid so that you know where to look all right that's all the stuff from Deep learning that I wanted to talk about now on to reinforcement learning so first and foremost we have to talk about how do you formulate a reinforcement learning problem what does that mean what does that do what are the pieces of it how do they fit together we typically use the language of saying that there's an agent that interacts with an environment so the agent is whatever thing is making some kind of decision the environment is is wherever those decisions are happening and the thing that creates the consequences of those decisions uh and there's this Loop where the environment has some State and has some measure of how good it is to be in that state that's a reward and the agent gets to observe the state and possibly the reward it uses the reward for learning whether or not it observes it is a subtle technical detail but anyway okay the agent gets a state observation and a reward and then the agent makes some kind of decision about what action to take it picks the action and it executes it in the environment and then the state of the environment changes there's a new state of the environment the agent perceives it the agent acts Etc uh the goal of the agent is to figure out what decisions will maximize the sum total of rewards that it'll ever get actually it's slightly more specific than this and there are a couple of different formulations that we can choose and we'll talk about them momentarily but that's basically it in a nut shell we want to maximize this sum of rewards that we get and the agent is going to figure out how to attain that goal through trial and error so you just don't know in advance what the right thing to do is so you have to just try things see what happens see how much reward you get and then adjust your decision on the basis of that so reinforcement learning is about algorithms for doing precisely that uh but before we can talk about the algorithms we have to introduce a bunch of terminology for those of you who have uh done the work of going through the spinning material online uh this will probably be quite familiar and I'm mostly going through it for the benefit of um the audience that I expect might watch this in the future as a starting point for this um so bear with me I'll try to go through this reasonably quickly but we have to talk about observations and actions policies trajectories rewards and Returns what the RL optimization problem actually is how we formalize it and then uh value and action value functions and also advantag functions so there's a whole lot of stuff that you kind of have to know and unpack in order to really fruitfully progress in reinforcement learning and and these are just those Central pieces so observations and actions uh a state is something which tells you absolutely everything about the environment the agent usually doesn't get access to the state there is usually some stuff that's just hidden from the agent so what the agent perceives is called an observation uh if the observation contains all the information in the state we call this environment fully observed if it doesn't we call it partially observed and uh States observations and actions can be continuous or discret for all of the problems that we care about in deepl uh the observations are continuous and the actions might be discreet or continuous a policy is a rule for selecting actions there are a couple of different ways that you can get to this kind of rule uh we typically classify them as one of two kinds stochastic or deterministic a stochastic policy is a rule for randomly selecting an action on the basis of the most recent observation or possibly proceeding observations as well uh a deterministic policy is just a map directly from observation to action and no Randomness involved at all uh you may be wondering why it would be useful to have a random policy at all because it might seem like Randomness is just sort of dangerous uh but actually it can be quite helpful and there are some very principled ways of optimizing stoas policies and it's a little bit hard to optimiz completely deterministic policies um there may also a matter of robustness in that having a little bit of Randomness can make you more robust sometimes to perturbation than Having learned a brittle specific deterministic policy so now just to give some sort of concrete examples in uh tensor flow because I assume that most of you will probably have met tensor flow as your first deep learning library and if not P torch and um for those of you who who are stuck with tensorflow I'm so sorry you probably should have picked P torch I know I should have um but but here we are uh so in in tensorflow for a stochastic policy over discrete actions we might first set up a placeholder for loading in observations and then we might set up a uh multi-layer perceptron Network an MLP Network so this is just the most basic kind of feedforward neural network the thing that I talked about earlier which is a succession of linear transforms of inputs followed by nonlinear transforms of inputs in this case the linear transforms take you to something of size 64 and uh there are two of them and then the activation is a tan Activation so this gets you to a range of minus one to one in a nice smooth way and uh and then we produce logits based on the output from that piece of the network so logits are basically um something that predes having probabilities for particular actions if you take the soft Max of the logits if that's not a function you're familiar with I I recommend looking it up it's just something that exponentiate all the logits and then divides by the sum of uh of those exponentiated logits so so it normalizes the distribution to to being a probability distribution all the entries have to be greater than zero and Su up to one so uh we get logits and then we get actions by using tf. multinomial to sample something stochastically assuming that the probabilities are based on the soft Max of those logits uh you can IGN The Squeeze that's just there for making that certain things actually work uh and then in a in a deterministic policy um let's say we have a continuous action case so we want to Output a vector of actions where each entry can be any real valued number uh we will just go from observation to network to a final layer uh which is just going to be the actions all right so that's policies let's talk about a trajectory a trajectory is a complete sequence of states and actions through the history of an environment the agent starts in a state takes an action then there's a next state next action Etc the first state in the environment is sampled from some PR previous distribution over starting States and then afterwards uh State transitions are going to be either uh deterministic or stochastic but there's just some rule in the environment that given the current state and the current action whatever action the agent took uh picks what the next state is trajectory is also sometimes called an episode or a roll out you'll see this terminology used completely interchangeably um so just be aware that's out there there's I'm so sorry in every neish field uh a lot of terminology confusion where different people in different areas of Academia worked on it for a while and used different terms and then in the end we're left with just a weird mishmash um notation to uh you're going to see some notation where States and actions are notated by and a um and then in code you'll see some places where it's X and U and this is because of the uh ancient Eternal conflict between the control theorists and the reinforcement learning theorists and we're just stuck with it now so that aside let's talk about rewards and returns so a reward function is going to map from States and actions or states and actions and possibly next States onto just some number that tells you good or bad positive is good negative is bad uh the more positive the better and you have to if you're a designer setting up a reinforcement learning problem you have to pick what that reward function is going to be so you want to make sure that you incentivize the stuff that you want to have happen and disincentivize stuff that you don't want to have happen so it's a very simple example suppose that you want a robot to run forward but you don't want it to uh waste a ton of energy so maybe you will uh give it a reward proportional to its forward velocity but you'll penalize it proportionally to the some of the action mag to the action magnitude um so you'll discourage Superfluous actions the return of a trajectory is going to be some cumulative reward along it we have two ways of formulating this and what you're going to find in deep reinforcement learning implementations is that we're going to uh completely conflate which problem we're trying to solve with the other but the finite Horizon undiscounted sum of rewards works when you have a finite Horizon it doesn't work when you have an infinite Horizon because if you have an infinite sum of things it might diverge unless you do some kind of discounting so in this other case infinite Horizon discounted some of returns you have a discount Factor gamma between zero and one and that's how you downweight things that happen in the future this make sure that this is a reasonably well- defined quantity um but why would it make sense to Discount things uh you probably would rather someone tell you that they're going to give you $100 today than $100 in 100 years right like it's just good to get rewards up front then there's the reward to go this is closely related it's uh basically just a measure of return starting from a particular time step or state so the reward to go from some point in time is just the sum of rewards that'll happen after that point in time and now we can talk about the reinforcement learning problem just formally we're going to set up a performance measure for a particular policy Pi J of Pi which is the expected value of return for whichever formulation we picked uh according to a distribution over trajectories in the environment based on the choice of policy so what that means is that um again start States come from a starting distribution transitions in the environment are based on something in the environment that uh transition distribution p and actions will come from the policy conditioned on the uh the observations of the states and we want to find the optimal policy Pi star which maximizes this now we have to talk about value functions so value functions are uh measures of how much reward you expect to get from a particular state or state action pair assuming that you're going to behave a certain way so we have the on policy value function and action value function V pi and Q Pi which respectively tell you how good it is to be in a particular State and how good it is to be in a particular State action pair assuming that forever after being in those places you act according to the policy pi and then there's also varar and qar same thing except if you were to according to the optimal policy it's great to know qar as uh as we'll talk about momentarily um value and action value functions are connected the value is just the expected action value expecting over what action you might take according to the current policy and the advantage function tells you how much better a given action is than average and it's just the difference between q and V these value functions satisfy recursive Bellman equations these are super important and they're the foundation particular remember I met reinforcement learning um I was just so turned around and lost by these the notion that there was going to be this uh recursive equation where the definition of a thing depended on itself um was was quite confusing but uh it's it's it's worth just hitting your head on for a while until it makes sense um but what it's saying is that the value of being in a particular place is going to be as good as whatever reward you get for being in that place plus all the rewards that you'll ever get for all the Places You'll Go afterwards now why is it great to know qar qar tells you if you're going to act according to the optimal policy forever after you started in this state and took this action and we don't care what policy this action came from how well will you do so that means that if you want to do the best you possibly can do all you need to know is what action maximizes qar in a particular State and then take that action because that's going to be the best action in that state and then afterwards you've assumed that you're going to do the best that you can ever possibly do so if you have qar you basically have the optimal policy so this is going to lead us uh ultimately to the two different kinds of algorithms in reinforcement learning for control where in one case we'll try to directly optimize a policy and in the other case we'll try to find qar now if we want to uh find qar we have to set up a function approximator for it Q Theta which will represent by some kind of deep neural network and we're going to want to uh measure how good is it at approximating qar and this is what that recursive Bellman equation is going to be really helpful for because the beautiful thing is we don't need to have acted according to the optimal policy to check how well Q th fits that Bellman equation we just need a bunch of examples of State action next State and reward tup and if we have enough of those over enough of the environment then we can probably do a pretty good job of fitting Q Theta based on that Bellman equation based on maybe this mean squared Bellman error and then use that afterwards for control which is having a a decision- making rule by the way I apologize if uh anything has been confusing about my using sort of the terminology of control interchangeably with the terminology of reinforcement learning uh when I say control I mean having the best policy so now what kinds of RL algorithms are out there uh Behold a taxonomy which is um much more restrictive than it looks it looks very pretty and it looks very definitive but it's actually masking a lot of subtlety and uh you know detailed choices and the fact that there's actually a lot more bleed over between these things than you might expect but at a very high level um this is a useful picture to start with that we have two different kinds of RL algorithms ones where we have access to the model of the environment and ones where we don't so what that means a model of the environment is something which tells us if we're in a given State and we take a particular action what's going to happen next the model would predict what the state of the environment will be after that and that's really useful because if we can forward simulate the environment then that's extremely helpful for evaluating our current policy it's extremely helpful for figuring out what a better action would be than the one that we might want to take uh so if you don't have a model you're quite limited you just have to figure out how to do well based on experiences that you've seen your direct interactions with the environment you don't get any other information but if you do have a model it's quite potentially powerful although as we'll discuss uh the methods for modelbased reinforcement learning are not quite as mature so far as the methods for model free reinforcement learning uh so now okay that last slide was just a ton of acronyms maybe not that ins let's talk about what these algorithms are doing there are three key pieces in any reinforcement learning algorithm for one you're going to run the policy in the environment you're going to actually try things and get to some signal error or otherwise and then you're going to have to reflect and evaluate whether or not those decisions were good ones whether or not those actions were the right ones you have to figure out how good your current policy is so that you can use that information to improve it so uh you run the policy you evaluate the policy you improve the policy and there are a bunch of different ways of doing that um and we'll go into some depth about how different algorithms go about doing that um so let's start with policy optimization uh minor interlude in the chat last night I surveyed people to see what they were interested in I asked if people were interested in math there's going to be some math so so so first at a very high level zooming out 10,000 foot view in policy optimization we're going to run the policy by collecting complete trajectories or Snippets of trajectories based on our current stochastic policy and we're going to explicitly represent that stochastic policy with a neural network that perhaps gives the sufficient statistics of the action distribution or um something else that we can use to derive that and sample from it and then we're going to evaluate the policy by figuring out the on policy value function and Advantage function uh and we're going to evaluate those things for all the states and actions in the trajectories that we sampled and then we're going to improve the policy by making it more likely that we take the actions that led to higher advantage and making it less likely that we take the actions that led to lower Advantage less likely that we take the bad actions how do we do that uh we're going to have to talk about some math now uh I realize there's a chance that most of you maybe weren't expecting that we would be doing any kind of deep mathematical Excursion but um if there's one thing that I want you to take away from today aside from just being excited about DL it's a realization that there are some limitations to what DL can currently do and that this is not really 100 % done as a technology where you can just apply it to a problem without really thinking about what it's doing under the hood and get a good solution it's not a blackbox technology yet so if you want to try deepl on a problem and grapple with getting it to work you do have to kind of understand what's going on under the hood and that means taking a look at some of the Gory mathematical details understanding how they connect and forming an intuition for how those details will shape the failure modes of your algorithm so what talk about um we're just going to talk about vanilla policy gradient we're going to talk about how you derive the policy gradient and a bunch of different equivalent expressions for it and then we'll get to the pseudo code for the sort of standard version of theilla policy gradient which includes maybe a few more uh tricks and details than the very most basic vanilla version um apologies for the choice of words there um but all of this stuff is critical to understanding uh more Advanced policy optimization algorithms like trpo and poo um we won't be covering them in these slides but again happy to talk about them offline during the hackathon so in policy gradient algorithms what we want to do is we want to find some kind of expression for the gradient of the policy performance with respect to the parameters of the policy and we want to just directly gradient descend on those parameters so we're going to move the parameters in the direction that increases performance uh and is this going to be easy or hard well if we just try putting the gradient uh onto the policy performance we run into a problem all the parameters are down here in the distribution they're not inside here where we would like them if we want to get something that we can actually use we'll have to do some messy work to bring the gradient inside of an expectation uh which we could then form a sample estimate of so step one to getting the GR symbol somewhere helpful we're going to recogize that this expectation can be Rewritten as an integral uh going through all of the events in trajectory space every possible trajectory of the density uh the probability mass or density for that trajectory based on that policy times the return that you would get for being on that trajectory and now we can bring the gradient in because the limits of this integral don't have anything to do with the parameters and then we apply the log derivative trick so this is a really helpful mathematical trick comes up all over the place in deep reinforcement learning it's basically just this notion that uh the derivative of log of something is one over that something times the derivative of that something and we rearrange it slightly but it lets us go from the gradient with respect TOA of P to P * gradient log P this is great because now we have an expectation again we have an expectation based on trajectories sampled according to the current policy so if we have that data we can make a estimate certainly um so the very nice thing here is that what we did after bringing the gradient inside the integral and doing this log derivative trick is that we now have something which is an expectation again because we're integrating through all possible trajectories of the probability density Associated to that trajectory times something which is a function of that trajectory so this is an expectation and we can form a sample estimate of it that we can use in a practical algorithm but we're not completely finished yet because we still have to talk about what's the gradient of that log probability for a trajectory how does that depend on the parameters of the policy so uh let's go back to the picture that we had in the beginning there's a starting state which is drawn from some distribution based on the environment and then after that you pick it the agent picks an action based on Pi Theta and it has probability Pi Theta a given uh s for for time Step Zero then the environment picks the next state according to whatever distribution it has over next States given your most recent action and the most recent state by the way uh this is something that I glossed over earlier slightly more formalism details uh that you don't quite need to know but this is called the Markov property this notion that picking the next state only depends on the most recent thing that happened and doesn't depend on the past before it um that's the the Markoff property and you'll find a whole of math if you go diing for it but you don't have to for for this at the very least um so then what we have is that the probability of the trajectory is going to be just the probability of that first state times the probabilities of each transition and action selection that happens afterwards so we get that expression up there at the top and now if we want to take its gradient of its log uh we just pretty straightforwardly compute first the log of that thing turns that product into a bunch of sums the gradient goes through the sums and now all the terms that are based on distributions from the environment have no dependence on the parameters of the policy the environment doesn't care what the policy is it's just going to behave in whatever way it does so those have no dependence on the parameters those derivatives are zero and what we're left with is just something which is a sum over time steps of gradients of the policy the beautiful thing is because we control the policy and we have EXP explicitly represented it as a neural network and we can compute all of its gradients this is a thing that we can calculate so now we're at something where we can in fact calculate a sample estimate of this gradient of policy performance and use that as the basis for a gradient Ascent algorithm for improving performance but it's not good enough we're not done yet uh yes uh the function capital E so um so this this capital E is an expectation and if we want to form an estimate for the expectation so we're not going to compute the expectation exactly um what we're going to do is we're going to see what happens for a bunch of different trajectories that are sampled according to the distribution specified in that expectation and then we're just going to average them and in the limit as we have an infinite amount of data that sample average becomes exactly equal to the expectation yes absolutely absolutely you can so it is a uh bunch of derivatives of the final output with respect to each one of the parameters right because there are many inputs to this function and we're going to have a derivative with respect to all of them yes um I'm sorry can you repeat the question yes can we tie this explicitly to reward so inside the expectation here we have uh R of to so that's the return measure that we've chosen whichever one we picked either the infinite Horizon discounted sum of rewards along the trajectory to or just the finite Horizon undiscounted sum of rewards so that R of to is the sum of all the rewards in that particular trajectory and uh that's actually why the the variance of this is going to be so unnecessarily High they're going to be a bunch of terms in this sample expression actually just in in that expectation which uh which have uh expectation zero on average they're zero they don't contribute anything but we sample them anyway and the samples will have noise on them and so we'll just wind up getting the noise we won't get much signal from them so can we eliminate a whole bunch of terms yes we absolutely can uh the intuition here is that if I give you uh a reward in the past and you want to update the action that you just took really what you care about for figuring out whether or not the action that you just took was good or bad are the consequences of that action you don't care about what preceded it that action and what preceded it are almost completely uncorrelated they're they're you're not not going to uh to get anything um by by updating the likelihood of that action based on an old reward so that in expectation is going to be zero and knowing that we can now expand out this return measure and we're going to look at this in the finite Horizon case just for Simplicity but this analysis also extends to the infinite Horizon case so we now have a sum of grad log probs of the policy times the sum of rewards we're going to pull the sums out of this expression so that we can just look at a policy update at a particular time step times a reward from a different time step and then based on that thing that we asserted above we're going to drop all the terms that are inconsequential all of those are zero and so what we're left with after we take away all the ones where uh T greater than T Prime we're left with this sum sum over the time steps for the policy times a sum over time steps for rewards that goes for all of the time steps after the corresponding policy time step and then if we bring that back in what we're seeing now is that we want to for each time step adjust the probability of the action from that time step in proportion to the sum of rewards that came afterwards only the consequences of an action will affect its update uh yes so it's not that you don't consider past actions the sum over here in the beginning runs over all time steps so every action is going to get some update it's just a matter of which rewards are used in figuring out the update for that action and it should only be the ones that were consequences of it yes um well we do care about the future right because here we have a sum of rewards after a particular time step all the rewards in the future from that time step um so so that expectation that's just saying that um uh an action that happens later shouldn't be affected by a reward that happened before it um it should only only be affected by the rewards that happen afterwards so um in the in the next slide actually we'll see how this expression that we have down here at the bottom connects to the value functions so what we currently have is what I'll call the reward too policy gradient because what we're doing is we're adjusting the probabilities of action proportionally to the reward to go uh what we're going to do now is go from that into an expression that has Q Pi the action value on policy for a state action pair instead of that reward to go and this works because you can break up the expectation so first we're going to pull the Su over time steps out of this and then this expectation over trajectories uh this is sort of subtle and and and maybe a little mathier than we can go into detail on here but I recommend that you go look on the spinning up website um in RL intro part three there's a link separately to a proof about this but if we think about the average thing that's going to happen over all trajectories that's going to be equivalent to the average thing that happens over all of the cases of something with the first T time steps of the trajectory where inside of the expectation we've moved all the stuff that happens in the future and we were able to move it inside past this one because this only depends on time step T this doesn't depend on stuff after T so only this stuff is going to be affected by averaging over the future and then it turns out that that expression the average sum of rewards that you get starting from a Time assuming that the state and action for that time step were fixed that's exactly equal to the action value that's exactly saying how good is it to be in a particular State take a particular action and then Forever After act according to a particular policy and now we have this expression for the policy gradient at the bottom uh we're most of the way through the math okay but what is it based line um a baseline is a really important thing because it's another tool in our Arenal for taking a policy gradient expression and turning it into something which is uh lower variants more likely to be useful for producing a good update to the policy and it's also the namesake for open AI baselines well let's say one of them it's a couple of things but uh we have a expression here at the top which I claim is basically true uh which is that the gradient policy gradient is the thing that we had before but instead of Q we subtract out some function of State some function B of St and I claim that in expectation it works out exactly the same and so there's a short proof here for that which is that if we look at the expectation for that part of it what happens if you take the expected gradient of the log probability of an action in a state times some function B of St uh the B doesn't have anything to do with the action so it's a constant with respect to this expectation so we pull it out and then what we're left with is an expectation over actions which we'll rewrite uh and now we have it in probability times grad log prop we're going to reverse the log derivative trick from earlier so this is now an integral over actions of the gradient of the probability of that action and we can pull out the gradient so we're just sort of reversing the procedure from earlier this thing this integral over all possible actions the probabilities of those actions that's just going to sum up to one that's just saying probability distribution is normalized uh all of the chances together have to come out to equaling 100% if you sum them up and the derivative of a constant since that's a constant is nothing constant has no rate of change so we get zero so all of the terms of grad log prob times the Baseline in expectation are zero so we're free to add this Baseline without changing what the policy gradient is in expectation but we can pick it in ways that are fruitful and make the estimate better so the typical thing to do is to pick the basine to be the value function and this leads us to kind of our our final sort of ultimate form of the policy gradient the form with Advantage functions and why is this good why is this good the advantage function says how much better an action is than average uh why would you prefer that over just how good the action is well let's say you have two actions one gets you $100 one gets you $11 you only sample the one that gets you 100 now when you're trying to update your policy you can feel really great about that oh man 100 is a big number I feel great I'm going to double down on that action you're acting suboptimally if you had been picking 50/50 on average you would have gotten $100. and you would have realized that the advantage of taking the action that you picked $100. 50 uh 100us $1.50 you lost 50 cents should pick the other action so you prefer to use advantages to figure out uh which actions to increase the likelihood of as opposed to just Q values all right summing it up we have these four different forms of the policy gradient that are all tightly connected we care about the last one but to get to the last one we had to go through the pain but now that we've all gone through that pain together you're stronger you can go and you can implement this and it'll work and you'll know why it works and you'll feel good about that and if it breaks you can fix it all right so then just to to sum it up this key concept we want to push up the probabilities of good actions push down the probabilities of bad ones and also importantly that expectation requires trajectories sampled from the current policy so this is the concept of being on policy in reinforcement learning that if you want to update your policy you have to use data from that policy you can't use data from some other policy unless you appropriately rewe it but reweighting data is complicated and uh really tricky so it's sort of preferred to not do it unless you are trying to build something uh new and cool and uh super sample efficient and you're willing to spend a lot of time and effort doing research on making sure that it actually works um but okay so the policy gradient expression gives us the policy Improvement step coming coming back a bit oh yeah sure um the question was um how do we know what the average reward would have been uh so that we could figure out how to make the advantage function in the first place do we compute it as we go and uh and actually that's exactly what the next slide is about which is how do we do that business of policy evaluation how do we find an estimate of the advantage function which is actually good and reasonable um if we just have a bunch of data where do we get the value function that we might use to subtract out as a Baseline and the idea here is that we're going to learn it from data and typically it's going to be by regression so this will be a sub routine that you'll find in most policy optimization algorithms where uh you're going to have a value function approximator another neural network and you're going to at each Epoch of the policy optimization algorithm uh update the value Network to try to match the empirical returns that you saw so for a particular State the value should be more or less the sum of uh discounted rewards that you saw after then and then when you have the value function approxim imator you can use that to estimate advantages and we'll talk AIT about uh estimating advantages from value function approximators on the next slide um but first you you may have noticed that I pulled a fast one on you uh which is that we went from in all the preceding slides dealing with the finite Horizon undiscounted case and then here in our optimization problem for learning the value function I've dropped in discount factors why is that the answer is because everyone does it uh this is where there's not a particular good reason in my opinion that this happens um but pretty much every policy optimization algorithm that I'm aware of every every single implementation uh uses discounted value functions and Advantage functions but then treats the policy optimization part as undiscounted um it creates some bias it seems to work shrug um It's Perfectly reasonable to do that uh so so it sometimes seems to be helpful to set the discount factor to something a little smaller than one so keeping it uh completely undiscounted would be gamma equals one um for whatever reason with some optimization problems or some some RL problems it's a little bit harder if you pick uh gamma equals one than gamma. N5 and I can't say that there's a particularly good reason for this um I would speculate that like in the beginning of training if you pick a very high discount Factor those empirical returns will be very noisy and if you choose a discount Factor less than one what you're going to do is you're going to attenuate some of the noise you'll bias that sum of rewards so that whatever happens soonest matters most and if you happen to see a a few positive rewards in a row then you'll latch on to that whereas maybe because of noise if you had really paid attention to everything out to Infinity you would have just gotten a bunch of positives and negatives and positives and negatives and they would have canceled out uh I think it's it's okay to think about it like that yeah yes yes that after a a certain point the trajectory just ends you get a Time step T and then it's over that's finite Horizon infinite Horizon you go out to Infinity all right uh so then how do we calculate the advantage function given data from trajectories and a value function approximator so a thing that I want to introduce here is this notion of endep Advantage estimates so what you're going to do is you're going to have uh a thing over on the left side that approximates Q pi and a thing over on the right side that approximates V Pi so this thing for QP remember that that's supposed to be an estimate for how well you'll ever do if you start an a state take an action and then act according to the policy forever after uh you can just use the empirical return the reward to go from that state as a sample estimate of the expected value which is the Q value um but in an endstep advantage estimate what we're going to do is we're not going to go all the way out to the end of the trajectory in that sample estimate for Q we're going to go n steps in and then use the value function approximator to assume what's going to happen for the rest of it and this corresponds to a decision about how much bias or variance we find acceptable in this Advantage estimator so if you pick n equals z then your advantage estimator in that case would be just the reward plus gamma times the value function approximator for the next time step minus the value function approximator for the current time step and that's going to be very uh High bias because whatever is wrong with your value function is then going to be wrong with your advantage function but uh it'll be really low variance because the only thing that's going to have variance to it is the reward and the uh stochasticity in the next state transition but if on the other hand you pick n equals infinity so for the Q approximator you just take the exact sum of rewards that you got in the real trajectory and then at the end you subtract out the value function at St uh you're going to accept all of the variance that's in the environment but the nice thing is you don't have bias informing your policy gradient estimator with this because uh in expectation the Q part is going to be exactly Q in expectation and the V part uh recall that that was a baseline that we added with a guarantee of no bias in the policy gradient so on expectation that part falls out and it's fine so uh the bias variance tradeoff is typically mitigated through what we call generalized Advantage estimation so this is a way of interpolating between all of those different possible choices of endep Advantage estimate where uh we use a Factor called Lambda so this is sort of like another discount Factor as a as the interpolation um variable and it's a hyper parameter and you choose it in each implementation that you make and it's generally good to set it somewhere between like 09 and 97 usually it's a set it and forget it in my experience um I can I can't think of very many cases when I saw a substantial difference in algorithm performance from adjusting it um beyond that kind of narrow range if you set it equal to one then you'll get uh exactly the case of um the N equals infinity and if you set it to zero then you'll get exactly the N equals z case um so it's good to kind of leave it in the range where it's putting a little bit more weight on the real empirical returns than the biased value estimator but not all the way to the extreme okay at long last I give you the pseudo code for the full vanilla policy gradient algorithm that incorporates everything that we've talked about so far what we're going to do is collect a set of trajectories by running the current policy in the environment and then we'll compute the rewards to go so that we can use them as targets for the value function approximator we'll compute the advantage function estimates with any method of Advantage estimation but typically generalized Advantage estimation and then we're going to use those to estimate the policy gradient with that we take a step of gradient uh gradient Ascent we might use an Adaptive Optimizer like atom to to accelerate uh the rate at which we learn and then we're going to do the supervised learning problem of trying to get the value function approximator to match the empirical returns and that's how we learn our value function and then we Loop that's vanilla policy gradient yeah absolutely so uh yes usually you will pick networks of the same size for policy and value function um in cases where the environment is partially observed you may want to have a single core recurrent neural network that's going to be able to remember past information and then give that core neural network uh separate outputs for policy and value function and then you'll train that jointly um and it gets a little bit complicated because I can't say that there's any good work in RL that I'm aware of that reasons about how it Alters performance for the final policy to be simultaneously optimizing with to both objectives on the same model um but that's what you would do in that situation so so yes typically they'll be about the same size unless they're actually sharing parameters and then they're sort of the same model yes does the choice of initial policy affect convergence uh wonderful question and and sadly in a lot of cases yeah um so this is part of what goes into my saying that deep reinforcement learning is not a technology that's ready to be used as a black box yet so when we do experiments in deep reinforcement learning we typically run the same exact experiment with different choices of new of seed for the uh random number generators and what we find is that the seed which in the beginning of the algorithm only changes the initialization of the policies and value functions happens to matter quite significantly some seeds learn some seeds don't some seeds learn much slowly much more slowly than others and there's no particularly good reason for it we are generally quite heartened when we find an algorithm that appears to be robust to initial conditions and where the uh average of the learning curves is quite narrow we think that's great and it uh doesn't quite happen as often as we would hope all right um do we have any other questions about policy gradients so in the bottom right hand corner there uh that says 47 out of 63 I may have slightly miscalibrated how long Parts one and part two were relative to the initial time slots of 45 minutes and 1 hour respectively this is by far the longer one but since we've been at it for an hour I think this is a good point to take a 15minute break and we'll pick back up to discuss Q learning uh after coffee thank you so much [Music] [Music] [Music] [Music] [Music] [Music] we will [Music] we will be resuming with Joshua Ham's introduction to RL in two minutes you hello hi everyone uh we're about to get started for the second part of intro to RL and just as a heads up um I prepared entirely too many slides for the hour and 45 minutes that I was scheduled to speak um please bear with that because uh you know this is the first time we're doing this and so I'm still getting calibrated on uh what we can get through in that amount of time but everything that I don't cover by 11: a.m. when I hand over the mic to the next speaker um I'm more than happy to share with you later today during the hackathon so in particular the material that I expect that uh we won't quite get to will involve an overview of uh what's been accomplished recently in deep reinforcement learning and where the challenges and limitations are and what the research Horizons look like on those limitations but before we do any of that uh let's continue our discussion from earlier and talk about the uh next major family for algorithms for deepl for control which is to say Q learning so there are a lot of algorithms that fall under this umbrella uh deep Q learning was one of the first algorithms that really made deep reinforcement learning viable and popular speaking from personal experience I just started my graduate student career in 2014 when I heard about the uh playing Atari with deep reinforcement learning paper I was just becoming aware of Topics in Ai and AI research and that completely and totally blew my mind it was the most exting thing that I had ever seen that a computer could just figure out from looking at what was happening on a screen how to behave how to play a game how to do something that I thought required some human spark of understanding and uh capability for joy and the and the computer had it it was beautiful and amazing and I it made me want to study this and participate in taking this technology all the way from where it was at that point uh to what it could be in the future anyway um Q learning so back to this um RL Loop that we have run policy evaluate policy improve policy in Q learning you run the policy by taking a step in the environment either randomly so there's going to be some stochasticity in what you do or you're going to act in a way which is called greedy with respect to your current Q function approximator so remember what you're trying to learn is qar the optimal action value function and if you happen to have qar then whatever action is the maximum or maximizes uh Q Theta in a particular state is the best action to take um but when you don't in fact have q theta equals qar then uh the the maximizing action probably isn't great so exploring a little bit by acting randomly is going to help you and then once you've taken that step in the environment so you send an action to it and you get back a reward in a next state you stored that transition state action reward next state in a replay buffer you save it for later because you're going to use it uh for learning how to evaluate the policy which is to say updating Q Theta to try to uh have it fit that Bellman equation and once you have that the policy Improvement step is just looking into Q Theta and saying what's the action that maximizes this policy Improvement is basically implicit in Q learning and uh we're going to structure our discussion about Q learning around the original uh deep Q networks algorithm but pretty much everything in this discussion is quite general for q-learning methods because they all kind of share uh this common DNA of you take a step in the environment you take some gradient descent steps on your Q fun function to to minimize a mean squ bman error and you use the techniques that we'll describe in a minute experience replay and Target networks to stabilize the learning procedure so uh Q learning updates by bootstrapping so what is what is that um it's this notion of how are we actually going to fit Q to that Bellman equation so we talked about minimizing mean squared Bellman error and it's a useful picture to start with and so I'm I'm going to keep using terminology although in a few slides I'm going to tell you something completely different and ask you to ignore this and pretend you never heard it but this is where all the paper start and this is where all the tutorials start so it's good to familiarize you um what you're going to do to update Q is set up this loss function where you're going to average or sum over data from your replay buffer D and you're going to have these transitions State action next state reward and you're going to regress Q Theta against targets Y where those y's are obtained uh basically from that Bellman backup from that Bellman equation as the reward plus the Q value in the next time step and this is based on the uh bman equation for the optimal action value function so it's going to have that Max over next actions which is to say that it's going to assume that you know if Q Theta was optimal if it was qar then whichever action maximized it in that state would be the best one to take and that would be the best value there so interestingly you don't propagate gradients through y even though y has a dependence on the parameters of Q Theta and uh the reasons for this are are kind of uh mathy so we'll get to them in a in a bit okay getting this to work so there are two main techniques that I mentioned there's experience replay and there's Target networks the idea behind experience replay is just that you want to use a really wide distribution of data for training your Q function you don't want to fit it really well to a very narrow region of transition space because if you do it's not going to be good anywhere else and if it's not good anywhere else you're not going to be able to bootstrap it to the correct values even in the places where you've been trying to fit it you'll get nothing which is actually useful for control so experience replay helps you brought in that data distribution fit Q well everywhere get something which is good for control Target networks so bootstrapping with function approximators is super super super unstable that thing that we said on the previous slide where the Y's depend exactly on the current thetas actually throw that out can't do that that won't work if you try to do it what's going to happen is typically that Q values will explode they'll go to something really large or really negative and that'll happen really fast you won't be able to control it even with reasonably well tuned learning rates you probably won't be able to stop it so instead what we're we're going to do is we're going to have some Target Network Q Theta targ and we're going to make sure that that Network tracks reasonably closely to Q Theta but there's going to be a lag so that it updates more slowly so that if you make an update to Q Theta which uh pushes a q value too high or a little too low then that doesn't immediately propagate into Q Theta tar and therefore does not propagate into the bootstrap so this this y thing we're going to call this a bootstrap and then this tamps down on instability um granted why if Q learning is so horrifically unstable would we want to do it like this in the first place why wouldn't we just differentiate through with respect to that bootstrap and the answer is it if you differentiate all the way through uh it tends to not work that well um and the reason that this thing does the reason that it works well if you do this kind of bootstrapping approach as long as you take some appropriate precautions has something to do with the theory underlying Q learning and we'll talk about that in a few slides but not quite yet you're spared for now so uh also another note in deep Q networks the particular algorithm that we're talking about right now action space matters a lot so what we did in describing that bootstrap we had a maximization over actions of the Q function if you have a Q function that accepts as input a continuous State and a continuous action and feeds that into a deep neural network trying to figure out the action that maximizes the Q function output is really hard that would be a non-trivial optimization problem an expensive sube so if we want to be able to get that Max over actions uh that's a case where we won't really be able to do it so dq1 will apply specifically to the discrete action case where we're able to use a network architecture that instead of taking a continuous action as an input at the bottom of the network emits action values for each possible output or for each possible action at the end of the network so a single observation goes in and then K action values come out where K is the number of actions one for each action and then because there's just a finite number of them it's very easy to figure out which action maximize the Q value we can compare all of them directly so now we can talk about the pseudo code for deep Q learning this is relatively straightforward uh based on the stuff that we just described there's one thing which uh is a little more specific than what I mentioned which is this business of Epsilon greedy exploration so I mentioned before that you're going to explore by sometimes taking a completely random action and sometimes taking the action which is greedy which maximizes your current Q function approximator so Epsilon greedy is a strategy for doing that where uh with probability Epsilon where Epsilon is going to be something small um you'll pick a completely random action so uniform random over the K different choices and uh with probability 1 minus Epsilon most of the time you'll pick the action that's greedy with respect to your current Q function so that's the uh run policy step and then after you store that transition into the replay buffer and a Neal Epsilon because over time you want to explore less and exploit more you want to rely on the policy as it gets better after doing that you're now going to evaluate the policy by uh learning qar from the data by improving Q Theta to be a better reflector of qar so that's exactly the uh step of gradient descent that we described which is that you sample some transitions from your replay buffer from your from your experienced replay memory and you compute the bootstraps for those Transitions and there's a special case for if a transition ended in a terminal state which is is that we don't give it a value after that particular time step and then we use those y values in our bootstrap Q value regression update the parameters and then every once in a while with some frequency we'll copy over the parameters of the main Q Network onto the target Network so that's the target Network lagging the Q Network ensuring stability and that's deep Q learning in a nutshell this algorithm kicked off everything I mean there's a whole bunch of stuff that preceded it you can't really point to any one moment in the history of a field uh that you know had no precedent before this there was neural fitted Q before that there was Q learning with linear function approximation and there were all kinds of algorithms for trying to get things to work with nonlinear function approximation like deep neural networks but uh but but this was the one that got a lot of people really really excited so anyhow copat on tour um buy or beware this can break uh this will not work on every problem out of the box you'll try it in some places and it just won't work you'll fiddle with Hyper parameters and it still won't work you'll try some tricks to stabilize it because there are uh pretty much Infinity tricks to make deep Q learning better at this point and some of the time that still won't work so this picture here is from a recent paper which I really love and which which I strongly recommend that you take a look at if you get interested in seeing um some analysis of failure modes for algorithms in deepl it's called Deep reinforcement learning and the deadly Triad the deadly Triad is a set of traits that uh deep reinforcement learning algorithms might have which are known to occasionally cause Divergence and to create substantial obstacles to theoreticians who would like to come up with algorithms that have provable convergence guarantees so the deadly Triad consists of function approximation off policy learning and bootstrapping which are exactly the three things that deep Q learning relies on we have function approximation in the form of neural networks we have off policy learning in the form of experience replay and we have bootstrapping in the form of uh using the target network with a one-step backup as the regression Target for Q and so deep Q Learning Works a whole lot of the time and then some of the time it just doesn't so in this set of experiments what the researchers did was they examined deep Q learning and a few variants of it uh oading on whether they would include a Target Network so here this Q does not have a Target Network the regression Target that it uses is exactly based on Q Theta not Q Theta tar and tried it with a Target Network and then tried a couple of other tricks that relate to uh how you use the target Network to possibly either estimate the value in the bootstrap or select the action in the bootstrap and those are tricks that are known to potentially help they looked at uh at all these different cases for many different Atari games as the experimental uh test bed and they clipped the rewards in the environments into a certain range so that they knew exactly mathematically what the ceiling for possible real Q value would be they chose it to be 100 and they looked and saw over all the experiments that they ran how often did the maximum absolute learned Q value in an experiment exceed the threshold which they knew was the real true maximum possible Q value and the answer was a lot so this shows uh that Q Learning Without Target networks is very unstable and that a lot of the time you will get this uh this Divergence phenomenon and even as you include tricks that make it progressively more stable you'll still expect to see Divergence every now and then so uh we're going to dive into a little bit of math now to kind of get maybe some intuition for why this is the case and what deep Q learning algorithms are really trying to do and how that translates into the algorithm or doesn't so we're going to start by taking the operator view of the Bellman equation so the optimal Bellman operator tar is a map from Q functions onto other Q functions and the value of T star Q for a particular State action pair is given by the the Q by by the Bellman equation that we saw before the optimal Q function is the fixed point of tar so qar equals tar qar that's great and T star has this special thing about it which is that it's a contraction map on the space of Q functions contraction maps have some very special properties that we're going to talk about now yay math so the main thing about a contraction map is this idea that if you have two points and you apply the contraction map to both of them they'll basically be closer with respect to some distance function after you've applied that map to both of them than they were before so expressed mathematically uh we have some some Norm some distance of the norm of a thing minus the other thing and the norm of f ofx minus F of Y is going to be less than or equal to some constant Factor times the norm of the difference between X and Y that distance between X and Y and when that beta is less than one then we have a a contraction that's saying it's getting closer together it's shrinking why do we care about contractions because they have unique fixed points and you can get to them by just repeatedly applying the operator to any initial Point um this is something called the B fixed Point theorem if you're interested in going on Wikipedia and finding something which is going to be more precise than however I've typed this up but in a nutshell to to show you that they have unique let's forget about uniqueness for a moment but at the very least that repeatedly applying this operator will get you to a fixed Point uh if we look at a sequence of points X and we have a contraction map f with modulus beta and uh each point in the sequence is just yeah generated by F of the previous point and we look at the distance between successive iterates what we see is that it's shrinking as a function of the iteration number so in the limit as the iteration number goes to Infinity that distance will shrink to zero it will converge repeatedly applying it will get you to the fixed Point tar is a contraction on Q functions so if you could represent the entirety of the Q function that is to say the Q values for every state action pair in the entirety of the environment which for all the environments that we care about in deep re enforcement learning you cannot easily do um you can only do this with function approximation which is to say you're going to generalize whatever you choose for the value in one state action pair will have some influence on another you can't completely separate them when you do function approximation but putting that aside so we could represent all the action values for every state action pair and we applied T the operator to that function we would get uh a new function Q which is closer to Optimal than the one that went in and if we applied it over and over and over again we would eventually get to qar the fixed point of tar uh this is value iteration it's a classic algorithm in reinforcement learning so before function approximation before deep when you had environments where there were a discrete number of states and a discret number of actions and you could represent the Q values in a table of elements one for each state action pair uh you could compute this exactly and use this as a way to get to qar now when you live in the problems that we do when you're trying to solve High dimensional complex video games High dimensional complex strategy games uh you can't use the table you have to use a function approximator and now your problem is that you can't compute all of tar Qi and even if you could you probably couldn't find a choice of parameters that would allow you to exactly represent it so if you want to do this kind of value iteration you have to do it approximately and this is roughly what uh Q learning algorithms with function approximation try to do which is that they push the parameters of the network in the direction such that you move Q Theta towards T Q Theta and sometimes this works and sometimes it doesn't because when you go to this fun function approximation setting this operation is not necessarily going to be a contraction on the space of Q functions you might have lost that property if you did expect Divergence in fact expect things to blow up horribly um if you preserved it or if you've done enough tricks to stabilize it things will work pretty well uh in my experience Q learning algorithms and their variants tend to be extremely sample efficient when they work which is quite desirable and it's very nice that they can recycle off policy data because oncy methods sadly have to throw away tons of it but uh last point on Q learning what you normally see in deep learning algorithms and deep RL algorithms is that Paradigm of there's an objective function and you optimize it and you find the model that optimizes the objective in Q learning don't be misled into believing however many times you see it that the mean squ Bellman error is the thing that you're optimizing uh you change that function every time you change the target the thing that you're really doing is this sort of approximate uh value iteration you're trying to apply an approximate operator which is going to get you to something better you're not trying to minimize a loss that's not to say that there aren't variants of these kinds of algorithms that uh do involve well- defined loss functions there's a whole family of algorithms called gradient temporal difference methods which if you are theoretically inclined and willing to go down a deep deep deep Rabbit Hole uh I recommend you check out talk to me if you want references also in the spinning up key papers doc I believe there's a book in the bonus section for classic RL papers and and review papers um Chapa Sear's book on RL algorithms from 2010 which uh Recaps a lot of this really great old stuff including gradient temporal difference algorithms so um I recommend you check that out if you're interested yes so how would we like are there any techniques to preserve the contraction uh or is it based on everything else I I I'm actually working on some research on that right now like I talk to me offline uh yes yes yes so so this thing yes is called a temporal difference error because it is the difference in uh the Q value based on the next time step versus the current time step yeah yes absolutely what is the difference between off policy and on policy the on policy algorithms have updates which are based on the expected values of things where the distribution in that expectation depends on the current policy so if you want to form a sample estimate of the thing in the update equation then you first have to run the current policy collect interactions with the environment on the current policy and use those samples for forming that sample estimate that's on policy because all the data that you use has to be generated by the policy that you're using at the time in off policy methods like q-learning what you do when you make an update is you use experience which might have been generated by older policies not the current one so the current policy you can think of as being implicitly expressed in the in the Q function approximator current value but uh many steps ago it was different and you got whatever data you got from interacting with the environment you put that in your replay buffer and then many steps later you still sample those States and actions from that replay buffer to help you form your your new update to the current Q function so when the data was generated by a different policy that's off policy yes in in what sort of gaming situation would we maybe use deep Q learning or like what's a use case for it um so there's a a fabulous use case actually Facebook recently released a paper on their uh machine learning and RL learning their RL platform called Horizon which they used to train with deep Q learning uh neural networks for making decisions about when to send you push notifications so actually dqn is in your phones right now okay then let's proceed to the next part which is uh model based stuff so I'm going to be pretty brief about model based stuff there's a very wide variety of different model-based algorithms and we're not going to drill down into them the way that we drill down into policy learning and Q learning uh but we will give a relatively brief overview of some of the more Salient points and a few algorithms that I think are particularly interesting so back to the loop uh run policy evaluate policy improve policy where do models fit in so recall that a model of the environment lets you predict what's going to happen next you can use that for pretty much any of these while you're running your policy before you take an action you can stop and imagine what's going to happen if you try many different things you can create partial rollouts that you can use to evaluate your different choices and then you might pick something different than you would have otherwise so that's maybe where it can appear in running in running the policy in evaluating the policy you can use that same kind of approach of just simulating look ahead data to help you get a maybe a more stable backup for your Q function uh or just use some kind of uh Monte Carlo tree search style algorithm where you're going to propagate Q values back and figure out like an average case Q value and then for improving the policy you can regress your policy network if you have explicitly represented one towards whatever the outputs were from that look ahead planning process so if you have a model it's very powerful you can use it in a lot of different ways you can embed it pretty deeply in into RL uh the problem is that models are very hard to learn and you usually don't have them so let's say you have just made a wonderful brand new complex physical robot uh unless you have a lot of hours to spare and control theory expertise you probably do not know how to fully characterize that and have a simulator model which is going to be accurate in any reasonable way certainly not accurate enough for training it in simulation and then uh directly applying that simulation trained policy into the real world um you may want to try learning a policy from data but this can be quite tricky although there are some really exceptional success cases but because yes uh yes you could make that argument um so I let's say hardness to learn is not a fun oh I suppose sorry the the question was can you make the same argument for value functions and I would say that uh hardness to learn in this case should be interpreted more as has the research Community figured out really robust reliable standard methods for doing it yet and not necessarily whether there's some intrinsic quality of hardness um finding the correct model is a supervised learning problem if you have enough data part of the problem in RL is that you usually don't have enough data and you would have to get it by interacting with the environment and there may be areas of the environment very critical to decision- making which you've just never observed yet so uh imagine that you are uh in a giant Maze and you can try to learn a model of the the maze as you go but until you've seen the exit your model ises not going to be very helpful for you in navigating except to help you perhaps avoid repeating places that you've been to already but uh but yeah in practice models tend to be so far hard to learn so let's look at uh maybe one case study in ways that you can use models so this is the case of uh planning and or expert iteration the basic idea is that you're going to use your model from a current state to look ahead into the future and help guide your decision about what action to take so in planning you might explicitly just base your decision about what action to take on whatever the output from that look ahead process is and your current value function in expert iteration you're not only going to do do that but then you're also going to have a explicit representation of a policy which you'll try to improve by regressing it towards the output from the look ahead process so as a case study consider alpha0 alpha0 is an algorithm which has succeeded at achieving superhuman performance in a wide variety of of complex two-player fully observed strategy games uh particularly chess go and shogi so this was a successor to alphago the algorithm that beat human Grand Masters and go and Alpha zero at the algorithm level is sort of beautifully simple you have a neural network that emits two things a probability distribution over moves to play p and a value Network that says basically whether or not you're going to win or lose V and you learn this with this just very simple uh regression approach where you're going to move the value function to be more like whatever the true outcomes from games and you're going to update the policy by using a model-based look ahead operator to figure out what a better policy would have been based on your current policy and value function and you're just going to move your current policy towards that and then there's also some regularization very straightforward and the the look ahead is done with Monte Carlo tree search so that's just stochastically considering different possible uh outcomes and then aggregating data after having done partial roll outs down the game tree um to figure out what would have been the best thing to do so this is one uh model based approach now this required having a perfect model of the environment and in games like chess or go this is feasible because you can fully Express the rules in a way which is easy to compute and forward simulate and you don't have to learn anything from data and you also don't have uh anything which is partially observed so your model doesn't have to do anything fancy to keep track of what's going on in the background very straightforward and this kind of approach can be very very powerful but the problem is that most conditions are not quite as ideal as this uh so another family of approaches is where you're going to uh use the model for policy evaluation so let's say that you have learned a model or perhaps you're given one but more often than not for these algorithms you're trying to learn it concurrently with experience you learn some models and then you're going to have the agent uh quote dream in them the agent will sample a bunch of fictitious trajectories inside of the simulator and use those as the basis for a policy Improvement step and algorithms that are like this uh there's model Ensemble trpo and uh I want to say metap policy optimization or modelbased metap policy optimization then you could also instead of using this for computing advantages and and a policy optimization style Improvement you could use this for Q learning as well where perhaps instead of forming the target based on the bootstrap which might be inaccurate uh on particular regions of State action space that you haven't visited you could use the model to simulate what the bootstrap might be in those cases and use that as your backup for Q learning so that's an approach called modelbased value expansion and these algorithms uh the gain that you get from doing this is ultimately in Sample efficiency so what happens in normal DL is that you use tons and tons of data from interacting with the environment to try to improve your policy or your Q function and you make progress at whatever Pace when you use the model and you offload a whole lot of the Improvement steps onto experience collected in the model that frees you up from having to have collected that amount of experience in the real world as long as your model is good enough if your model is not good this won't be very helpful but if it is good and if you only needed a little bit of data to train your model then you can get a lot of mileage out of it and your overall RL algorithm will have used less interactions with the Real Environment than otherwise this is great for cases where uh interacting with the real environment is very expensive so for instance if you want to train something on a physical robot that can be an expensive process the robot might be slow the robot might break the robot might have all kinds of things where it's difficult to get it to do that or it's difficult to reset it you probably don't want to have to spend that many man hours waiting around for the robot to finish its learning procedure so if you can offload some of that time into simulation then it makes life better yes is that what you would apply for cars is that what you would apply for self-driving cars uh that's a good question so I'm not actually all that familiar with uh cases where self-driving cars have fruitfully made use of deepl that's not to say that they don't I just don't know um I would imagine that in self-driving cars it's probably more a matter of collecting data from experienced human experts and then using that data as the basis for learning a behavioral policy um but I'm also happy to you know go through through this later and see what we can find in literature yes would would modelbased RL be more geared towards transfer learning um I think it could potentially be quite helpful so certainly uh when we think about trying to get robotics to transfer from say simulation to to reality you know we want to make sure that the model used in simulation is High Fidelity with respect to reality and if that's the case then this model you can think about Sim is sort of a model based approach and uh perhaps it's going to be very helpful all right uh and then there's this other completely orthogonal way of using models which I'm really fond of because it's just sort of weird which is that you actually take the model and embed it inside of a model-free agent where where the model is going to receive inputs from the from the environment and use that with some internal process of perhaps imagining some Futures and then transforming whatever representation it has of those Futures into something which then becomes side information to the model-free agent so you train the model separately from the agent the module that provides some information based on the model to the agent uh is sort of decoupled from it except that however it's going to process however the model free agent will process that information is based purely on the model free learning so this is an approach called imagination augmented agents I think this is really interesting and really neat um I'm not aware of a whole lot of follow-up work from when this came out I want to say last year or the year before but um I just think that because it is so different from the other model based approaches that's cool whenever there's something different it's cool all right that takes me to what was originally intended to be the end of part one but is now the end of both parts thank you so [Applause] much at this point I would like to turn over the mic and the stage to Matias plapper who is a researcher on the robotics team at openai and he'll be presenting on uh the work on the robotics team for learning how to do complex manipulation with deep reinforcement learning on a real physical [Music] robot great thank you do we have a computer yeah sweet for yay I think it works okay thank you cool um so hey everybody my name is Matias as uh Josh mentioned I'm super excited to be here and talk a little bit about what robotics that open eyes doing um and the talk that I'm going to present is called called learning dexterity uh as I mentioned this is basically the effort of the entire robotics teams for many months so everything I'm kind of talking about is not not just my work but uh these robotics teams work cool um so let's maybe start with talking a little bit about what robotics at openi is actually trying to do um and the ultimate goal I guess that robotics at openi has is to build some form of general purpose robot um so I think this kind of picture illustrates it well very well we have humanik robots uh today and we know that humans can do a very very large amount of different jobs and skills so can include things like cooking it can include things like actual labor in some form of agricultural thing uh maybe it's very precise kind of things uh like uh surgery or building things and putting things together and this kind of stuff uh and ideally we would like to have a robot that has a simil similar level of dexterity and a similar level of well General purpos if you will um the way robotics looks right now uh is very different from that um so we have these kind of very specialized robots um so an example I think that is good is the Roomba which is on the lower in the upper left corner here uh that can clean your house but it can only clean your house it can only vacuum your house uh and similarly you have things like self-driving cars which to some extent are also robots that are very good at one thing which is uh driving themselves but they cannot do anything else and the robots that are more kind of versatile and more complicated they either very often controlled by humans so an example for that would be doing surgery so we have robots that can assist humans in that but they're always controlled by human uh operator which is a surgeon uh or we have uh more complicated robots and factories but those are typically just programmed to basically blindly execute a given trajectory so someone sits with the robot and figures out how to do a certain um process in a factory and the robot is very very stupid and has no idea what's going on so the question of course is how can we kind of step away from that Paradigm and how can we have robots that work in the actual phys physical world and aware of their surroundings um and given that this is the spinning up Workshop that's concerned with RL it's not so surprising that we think RL may be a good approach to that um and we know that RL works really well in certain domains so I've picked out two examples here that probably most people have seen um on the left side we have Alpha gozero playing against Le it all uh a game of Go and um as you know Alpha go zero won this game in fact I think it won almost all games that it has ever played and the follow-up versions of alpha Gozo are Beyond super Beyond human capabilities when it comes to playing go um similarly we have uh Dota 2 so this is some of the work that the DOTA team at openi has been doing for a while uh we have this DOTA bot uh called open I5 that is uh very very good at playing the game Dota 2 which is is a 5v5 multiplayer game and it is approaching uh like professional level so it's it's consistently winning against semi-pros uh and we are already playing against some pros in fact we've done that last summer at uh the international unfortunately we have not yet won against those Pros so the question is uh how does this work in robotics and of course there's like a lot of work in this in robotics uh it's not like we we're the only ones doing this um and I just like to give a bunch of examples that I think are kind of illustrating what people are typically doing today um um the first approach here is somewhat recent it's from 2017 and I think it looks really cool so you can see um the agent is even able to use certain tools so in this case a hammer um it can open doors it can do all sorts of things the unfortunate thing here is that all of this looks really cool but it's only in simulation and ultimately in robotics it doesn't really count if it's only in simulation because you want the physical robot to do something otherwise it's not very useful um so the the other approach that people have been taking is to train on the actual robot itself um so this is a some work from 2016 where people have been doing uh dextrous inhand manipulation so the goal of the robot here is to kind of manipulate this this tube filled with coffee beans for some reason uh into a Target orientation um and they do all the learning on the on the actual robot and that of course has the advantage of um not having to do any form of transfer because you learn on the robot you exactly know how the robot is going to work and once you have a good policy you're done the downside of that of course is that well you have to run on the actual robot so it kind of breaks a lot on you uh it's very slow to do uh you can't really scale this up unless you get a lot of robots which is actually something that people are doing um so this is the approach taken by Google in typical Google fashion scale it up uh so just get a lot of robots and let them do it for uh two months in parallel and then you can suddenly train on the robot because well you have 20 of those doing it in parallel and uh it can do very meaningful stuff so in this case they have learned to grasp arbitrary objects out of this kind of box that they have sitting here and this is actually a very impressive demo like this kind of Bin picking stuff is actually very hard um the thing is still that obviously this does not really scale all that well because this is a relatively simple task yet you need 20 robots going for two months and you will also just have to babysit the robot all the time right like you'll have to repair it when it breaks you'll have to kind of reset the environment when certain objects fall out of the bin and all of this kind of stuff so it's just a lot of work um so what we're trying to do is to kind of combine the benefits of those two approaches so training and simulation and then transferring to the physical world uh which is called syto uh and I'll be talking a lot more about this uh but before I do that I I'd like to introduce you to the task that we actually uh have in mind when we when we do our research um so we decided to do dexas inhand manipulation uh and the reason for that is that it is first of all very hard to do and then second of all it is something that we're interested in because we know that our hands are these Universal an defectors right so human hands are very versatile in what they can do they can be very dextrous you can do an cooking thing or you can operate on a human if you're surgeon at least uh but you can also do very heavy lifting with it and you can use tools that are made for humans hands and these kind of things um so so so this is basically the motivation for why we choose this kind of hand and this kind of task um because it's hard and because it's also ultimately useful for the general purpose robot we would like to build and the reason why it's hard I think is summarized relatively well in this this kind of slide um so we use a hand called The Shadow dexra hand uh which is depicted in this picture uh it has 24 joints and it has 20 actuators um so what this means is that your policy at in every time set has to produce an action for 20 individual actu and it actually has to coordinate them right like you'll have to have different joints work together to do certain things um so it's a really high dimensional kind of control problem that's typically well Out Of Reach of what traditional control problems uh can solve um as I mentioned ultimately we want to run this on real hardware and uh so we have to work with the real hardware and all its flaw and issues so this includes things like noisy and delayed sensing so that's just a fact of physical Hardware system right like they will not have perfect information and they will have delays and certain certain quirks that you kind of have to deal with um the other issue that comes out of this sensing is that you actually have to handle partial observability so in simulation you have perfect knowledge of everything that's going on because well it's your simulation and you can just read out from your simulation what the current state is uh but on the physical system you can only use what you can actually sense so obviously certain things like the friction for instance of the system cannot directly be observed uh and then last of all this is actually super hard to simulate as it turns out um the reason for that is that you have a lot of contacts going on so if you have something in your hand like you kind of constantly touch it and contacts are notoriously hard to model accurately first of all and then the hand itself is also incredibly complicated so it's tendon actuated which means that you kind of have uh tendons pulling uh and this causes a lot of unmodeled kind of things in your in your Hardware that you have not modeled in simulation cool so as I mentioned we set out to solve this problem uh with our Sim toore approach uh so we train in simulation and then we transfer to the physical Hardware um and while this sounds very easy it is not very easy uh because the transfer problem as you'll see is actually not very easy to overcome um but before we talk about that let's uh have a look at what what we can do in simulation and what the policy that we train looks like in simulation I think this also illustrates the task at hand so that you can actually understand later what what the robot is trying to do um so as you can see you kind of have this block uh with colored faces and uh the task is to rotate this block into the desired Target uh orientation that you have and the target is depicted as this kind of like semi-transparent uh additional block on on the right hand side so now it's trying to bring up the the blue phas a yeah it got it and then it kind of moves on to the next goore and as you can see it this kind of involves like it coordinating its fingers it has to kind of use uh its Palm it's kind of using gravity to let it roll and uh it's like even in simulation this is not super easy to learn the hardware itself looks like this so this is uh the cage we call it um it it houses all sorts of things uh in the middle of course you have the Shadow Dexter's hand which is the the robot itself uh and then you have it surrounded by quite a lot of these facebase tracking cameras so we have 12 of those in total uh and uh what they do is they provide you with relatively accurate sensing in in cartisan space so we have uh LED markers on the the hand itself so we know where the hand is and we also have led markers on the object so we know where the object is and those guys basically they sense this SL of the LED and sense multiple cameras can kind of see the Same Led marker they they can do triangulation and you can recover the uh the position in in space from that information um we also have an alternative setup because as I mentioned ultimately we'd like to have something that's more General and having uh a motion capture system is not very kind of real world likee um so we also have RGB cameras so those are regular RGB cameras we have three of them surrounding the scene uh and they can also be used for sensing uh in fact they can be used for post estimation of the object so you don't even have to have any any special kind of sensing on the object itself the cameras can do it for you and the reason why we have three is just just so they can first kind of uh recover depth information and then second they can also uh kind of work around occlusions because it's in the hand from certain angles you cannot sometimes see the object because it's kind of covered by the hand so this is how it looks up close when we run things um so as you can see we have the we have the hand with the block uh in it Palm uh and in this case it's the block that we use for facebase tracking so you can kind of also see the LEDs on it that we use um this is simply much easier to do when when kind of testing these algorithms so we have these kind of dual setups all right uh so the big question of course is how do we do the transfer um so I I showed you a video of the policy doing its uh thing in simulation and I showed you the physical Hardware so we kind of have all the building blocks but how can we actually transfer it to the physical robot and if you just train it in simulation it will not work at all it's a short version so I'll be showing some kind of uh numbers for that as well but uh there you can believe me if I say the transfer problem is really the core issue that we're dealing with here and the approach that we're taking is relatively straightforward actually um so what we do is we use two main techniques uh the first one of course being reinforcement learning to learn the actual control policy and then the second technique being domain randomization to make sure that the learn control policy actually transfers to the physical system and I'll be speaking about both of those uh in a little bit more detail so let's get started with domain randomization so this is a technique that has been used uh for a little while uh a pretty popular paper when it comes to this is from 2016 uh in this paper what they did is they learned to fly a drone uh and the way they approached this is they trained it only in simulation using these kind of randomized uh buildings so you can kind of see it has a lot of different rooms in it um the uh textures are very different so the walls look different the ceilings the floors uh and they train a drone to fly in all of those rooms um and what they then do is they take this drone that was only ever flying inside of simulation and show that they can actually fly in a completely different actual room uh simply because it kind of has seen all of this variant during during its training it kind of like from its perspective what happens is that the policy thinks oh just this is another like randomization it's kind of weird but well I know how to handle it um so it flies in the actual room and people at open ey has been using similar approaches uh as well so this is some work from my colleague Josh Tobin um what he has been doing is he has been using domain randomization for grasping so this is using a robot called the fetch so it's um you'll see see a better picture in a moment but it's basically a uh a simple robot arm with a parallel gripper at the end and what he would like to do is pick up these objects that you kind of see in these randomized scenes uh and by basically using the same approach so he's randomizing all sorts of things like the looks of the objects the shape of the object uh the background the color of the table as you can see uh he can then use this information uh or this training to transfer to the physical robot even though it has never seen the actual physical table and what was pretty surprising in this research is that it turns out you don't even need photorealistic rendering so as you can see like this looks not realistic at all it's like pretty shitty computer Graphics uh and and still it transfers to the physical to the physical world so the important thing here is that you have this variety and not necessarily uh realistic environments yeah yes so all of the the two approaches that I show are using uh using Vision to learn a policy yes in this case I think it's actually not using uh the vision to learn a policy directly I think it's instead just predicting the location of the object and then there's a policy that that kind of can grasp it from that so some some other work in this domain which I think is equally important is uh physics randomization and this has been done by Jason Pang uh who used to be an internet openr in 2017 uh and he's basically using the same idea of randomizing but now for physics instead of uh like visual appearances so it's kind of hard to like visualize what's going on but what the policy in in training sees is certain worlds that are just different so maybe they have different masses maybe they have different frictions of the table uh maybe uh the robot itself behaves differently uh and so on and so forth and what he was able to show is that this again is sufficient to train strictly in simulation and then transfer to the physical robot so the test at hand here is again with the fetch robot and it's trying to move this this Puck to the go location which is marked in in red um and on the left hand side you see a policy that has been trained with this physics randomizations and on the right hand side it has been trained without and as you can see obviously the one on the left hand side does a pretty decent job it's like relatively precise it can push the park where it wants to go and the one on the right kind of freaks out so uh it shakes very violently in fact the building was shaking when he was deploying this um and it cannot really do the job and the reason is that it well has kind of overfit to to the simulation which simply is not fully accurate even though it's calibrated to be close to the robot uh and then it doesn't generalized to the actual physical world whereas the one with physics randomizations does Okay cool so that's domain randomization in a nutshell so both the visual randomization and the physics randomization yeah was the phys or yeah it's it's not very realistic honestly I mean it it's realistic in a sense that it's the physical so if you randomize too much your simulation will become unstable because you've set certain parameters such that they cannot make sense anymore uh but it's not very realistic like the the masses will be very high sometimes it's like super hard to move the puck and it's more about diversity again yeah okay cool uh so I'll now speak about our approach so what I previously talked about was mostly other people's work uh even though they're also on the robotics team but this is the the learning dexterity approach that we took so again remember the goal is to have the shadow hand rotate uh an object in hand and uh to kind of start us off I I I think it makes sense to just give you the the overview of the entire system end to end then we'll kind of dive into some of the details uh after that so again as I mentioned all everything we do is only in simulation so we never see the actual physical robot until we run on it like we've never seen it um so so the way it works is that we collect a lot of data in simulation so we have many many simulations running in parallel which is kind of depicted here in box a uh and all of those are randomized which is kind of visualized by them having different visual appearances but also think physics randomizations uh so the friction and the masses will also be randomized and using this collected data we basically end up training two different uh Networks so one of them is a policy and the other one is is a vision Network because we'd ultimately like to run this from Vision alone without the face Bas um the policy network is what is depicted in box B here and the way it works is that it takes the observed robot state which is the position of the five fingertips so you have those in cartisian space so 15 dimensions in total so it knows where its fingertips are and then also the pose of the object so that means just the the the orientation and the rotation in space sorry the position and the rotation in space um and this information is then fed into an lsdm policy so it's a recurrent policy and it produces the next action and we train this in simulation using reinforcement learning um the second Network that we have which is actually distinct they are not end to endend this is two networks that we train separately is the vision Network and the way the vision Network uses uh works is that it takes uh three different images so remember we had these three RGB cameras surrounding uh so images rendered from thep respective of those but again only in simulation and then using a convolutional neuron Network predicts the pose of the object from that information from those images um and again this is only trained in simulation when it comes to to actually deploying this to transfer as you can maybe kind of guess is that we can combine those two systems to to get us what we ultimately would like so you use the actual cameras to sense the position or the pose of the object using the vision Network so you feed it into that uh and then by having the object pose and the fingertip locations you use your uh LSM policy to produce actions um and that allows the robot to basically see what is going on uh and react accordingly uh while only being trained in simulation yeah um potentially honestly uh we have most mostly Ed this approach because we knew it worked from previous research um it is almost as accurate as facebase and facebase is very very accurate um I think if you spend a lot of time you could probably develop something with more traditional methods I I don't question it but like we would like to have something that's more General again and having a convolution convolution neuron Network do it seemed like the most General approach we could have yeah yeah it's kind of interesting so ideally you would just use whatever the robot has as joint sensing so it knows or it should know what its own joints are uh as it turns out the sensing in the shadow hand uses uh hall effect sensing which is a magnetic kind of sensor and they interfere quite a lot so if your fingers are closed together you will actually not know where your fingers are um so that's the reason why we don't use it we would like to use it but it turned out to be not precise enough for what we ultimately wanted to do so we couldn't actually rely on that but yeah you're right like like this is more of a more of a workaround like ideally the robot should just tell us what the joint positions are and then we wouldn't need the fingertip positions no it actually has very limited information it's it's very surprising that it works like that yeah yes yeah yeah uh very very good question uh this is there's a lot of debate about this I don't think it does we have some indication that it doesn't in fact it seems to help like the performance seems to improve over the board like we have certain ways of measuring Sim to sim transfer and when we randomize more we tend to get better performance on all the environments so I don't think it's it's it's compromising actually I think it's more of an Adaptive policy but then there's people who disagree so it's currently a little bit unclear um Okay cool so as I mentioned we need to randomize and uh of course we use appearance randomization so this is only for the vision Network so this is basically what I just described before just for our setup so you can kind of see we have the three different cameras showing the same scene and we randomize the scene quite quite heavily so um the robot changes its color the background changes the color uh importantly the the the block itself stays mostly the same because it actually has that color like you cannot randomize that arbitrarily um but we changed like the the material of the of the block as well so it looks slightly different um and then uh we of course have the vision Network which again is relatively straightforward so the way it works is it takes uh those three camera images then uses convolutions and the rest net architecture uh and the spatial soft Maxs to kind of process them and then simply con Cates all the things and produces the final object position and uh object rotation so the pose of the object and this is simply trained with supervised learning because in simulation you actually have perfect ground Ruth which is another very convenient thing you actually perfectly precisely know where your your object is you have not to actually sense it at all um and this is what the model actually sees so it's actually I think very interesting because it looks very very different from the randomizations um and yet it generalizes to that simply because it has seen enough variety that it's kind of okay with with yet another variety that's kind of weird but still within distribution in that sense so when it comes to the physics randomizations that we use um we randomize quite a lot of things as well so we have things like object dimensions for instance uh we have things like masses obviously and then mostly things about the robot itself so things like uh the way we actuate the robot things like damping within its joints and all of this stuff and the reason for that is that it's actually very hard to measure this so another neat thing is that you can in this physics randomization actually account for your uncertainty so for the object Dimensions we know those with relatively little uncertainty because we can just measure the dimension of the block uh but things like the actuation we know much less about and so we kind of widen the randomizations for those uh and another kind of cool thing is that we randomize the gravity Vector which may seem a little bit weird um but it basically amounts to like when you when you mount the hand it's not perfectly uh parallel to the to the floor like it will be slightly angled because of imperfections and by randomizing the gravity Vector you kind of get this effect as well like it's sometimes slightly angled and it turned out to be actually very useful and then we of course also have noisy observations and noisy actions simply because it's a reality of the of the physical system um the policy is very very simple so what it gets is the noisy observations so that's the the five fingertip positions and the pose of the object and the goal so it's it knows what it wants to do and then we normalize a little bit so this is just making sure that things have zero mean and unit variance and then use one fully connected Rue layer and one lsdm uh to prod use the action distribution and from that we sample and then perform uh perform that on the robot so it's a relatively shallow and relatively small Network overall the more oh sorry yeah they only come in through the simulation they cannot be observed directly [Music] so sorry they are simply set in the simulation so the environment has been changed but the policy cannot sense this directly it it has to infer this basically because on the physical robot it also cannot sense it like we don't know what it is on the physical system so it basically what what we think It ultimately ends up doing is some form of system identification so when it's running it's implicitly inferring certain information about the environment and then using this information to to kind of adapt itself accordingly yeah sorry I couldn't hear yeah so so we add gion noise to to the observations and to the actions yeah all right uh so I think I'm running a little bit late actually how bad is this huh okay then we have to hurry a little bit um cool uh so distributor training let me speak about this uh and then I'll show a video um so distributor training I think is very interesting because we use basically the same system that the DOTA team uses as well so we have a very large scale kind of system and the way it works is that we have roller workers who generate a lot of experience and then we have an Optimizer machine that's kind of using this uh information to update its policy and we use approximate policy optimization for that so an on policy algorithm uh as I think Josh has explained earlier today and I think it's kind of cool that we use the same system as DOTA um let me skip over some things but I think I want to show this uh so this is when it's running on the physical robot um as you can see it's using Vision so there are no markers on the actual object uh the the robot hand is doing all of this uh this is not cut in any way it is not sped up um again the goal is depicted in the right corner here so it will try to get the ease front and the N phas up top um and it will get to 50 successful rotations in this case so it can do quite a lot of those um and it can run on the on the physical system and if I have enough time one one kind of final thing that I think is actually very interesting is that it actually learns certain strategies that happen to have names so we have finger pivoting where you kind of like use your two fingers to create a rotational axis and then you rotate around that and things like finger gating and the reason why they have names is because they are used by humans as well and they have been kind of studied very well um they emerge automatically in our case so we have never shown the robot what a human would do it has kind of discovered that itself and the reason why they come up is simply because it has a humanlike morphology right like it has a human-like hand and it just turns out that these strategies are equally useful for humans and robots but they have kind of been rediscovered quote unquote which I think is a really cool thing so I wanted to mention that uh and yeah we have some quantitative results that show that uh randomizations are very important so if you don't randomize you get no successes if you randomize you do uh it turns out memory is very important so you need an LCM you cannot simply have a feed forward policy and you need a lot of experience so for the policy we used a 100 years worth of data so imagine doing that on a physical robot like probably not such a good idea uh so but we can get away with it because we use simulation so we do all of this in 50 hours and I think with that I have to close all right thank you that was great that was thank you so much Matias uh we're going to switch out the slides and then please welcome to the stage the leader of the safety team at open AI Dario [Applause] amade all right just a minute to get the slides doesn't look quite right yeah it's very good thing that you're ensuring that computers in the future will not be as malicious often solving general intelligence would be easier than solving video conferen it might be true un good cool so I work on a team at open AI that thinks about making AI systems do what humans want them to do um which is you know kind of very Central to open ai's mission and you know which which we think of as you know something that our focus on distinguishes us from from other organizations we think it's very important particularly as systems get more capable to ensure that they you know both in a narrow and Broad sense benefit Society um so this Workshop is called spinning up in in in deep RL um so it's useful to step back and you know think about what is what has RL accomplished in the last couple years and where is it going um so uh you know this uh this is actually out of date we should add add a couple things to it but you know if we look at plain games like go if we look at from about a year ago multi-agent behaviors where you can use RL and self-play to train agents to Sumo wrestle each other off a pad uh we are able to play competitively against uh professional professional players in Dota 2 the robot results which you just saw and you know we should probably add just in the last uh week or two uh uh you know the results we've seen on uh Starcraft which is uh you know in some ways similar to DOTA but just a different kind of game with a different kind of properties and yet you know that shows that these techniques are really are really pretty General and are uh are advancing uh pretty quickly um so you know if we step back and reflect on you know kind of where are things going um you know some properties that we could point out of these RL agents um that that are becoming more and more true right that were not true 5 years ago but are becoming more and more true um we have systems that have an extended interaction with a complex real-time environment they have a very high level of autonomy and speed you could imagine systems like this in the real world being used to make decisions faster than humans can intervene or in more complex ways than humans could uh you know could could hope to understand so you know regulating the economy or financial system managing large networks of computers uh these are the kinds of things that asrl technology matures it will be better and better better and better able to do and uh you know the these systems unlike supervised learning systems and unlike in any interesting way you know the simple RL systems of a few years ago um these systems are able to teach themselves and discover their own strategies and in in many cases they discover non-trivial strategies you know just like we saw with the robot it kind of recapitulating a lot of strategies that humans use you know we see in go and DOTA and Starcraft a lot of human strategies that have names uh you know the RL system discovers and recapitulates but it also sometimes discover strategies that a human would never would never have thought of um so if we look at what these properties mean together um one thing it means is that the connection between us as designers specifying what we want the system to do and what the system actually does in theory the system does in theory if everything is done right the system does what we want but that that rope it's longer it's more frayed it's more tenuous than for just kind of less um less autonomous systems that we've we've designed in the past um and there are many ways relative to you know simple computer systems or machine learning systems like supervised learning for for for these system systems to go wrong um and so uh a couple years ago uh uh several people uh on most most of whom are now uh now now constitute the uh the open AI safety team started started thinking about this um you know we're worried about current systems we're worried about tomorrow's systems eventually we're worried about you know uh about about building general intelligence and what that what that will mean for the world and making sure that those systems are safe um so you know we wrote kind of position paper and this uh kind of started us thinking about uh you know the directions and how how to even think about this problem of you know do systems reliably do do what we want them to do um and the the kind of General framework and division we came up with was okay so you know let's let's let's narrowly scope the problem we're not we're speaking not about kind of wider societal impacts although those are also important but um you know just narrowly the designer had a clear thing they wanted the system to do and then you know the system gets trained it gets deployed it goes through some long process actual system fails at this catastrophically uh and we kind of divide it up into into a couple things uh one is you know uh you're you're giving the system some direction some objective function that it learns from like the reward in RL um there are ways for that to be subtly wrong and you can get spectacularly wrong Behavior if that happens um you might have the right objective function but your system has problems with robustness doesn't generalize well it you know exhibits exhibits unpredictable Behavior as its learning it does dangerous things even if the final policy it's going to learn Mak sense um and then as a reminder that you know there are like this all all exists on top of kind of software implementation that has bugs in and of itself and so you know these the A and B are new but they're layered on top of the general just the general unreliability of software um so kind of a useful way to think about let's put C aside because it's not really a machine learning problem more just uh you know a reminder that this is layered on top of existing problems um but a crude analogy we can make is uh you know it's a bit like the simple statistical concepts of bias and variance right better better objective function uh you know that's that's about reducing bias and making sure you're aimed for the right target robustness is is is about making sure that you're narrowly clustered around the Target and that you always get what you're intending to get um so we're interested in in both problems um because I have limited time I'm going to talk about our work on the getting the objective function side right I think you know uh open AI does more open AI safety team does more of that relative to other you know other teams say at Google brain or deep mind that that think about these problems um and so I'll I'll mostly talk about that but uh increasingly and maybe I'll have a little bit of time to talk about it at the end um we're also think about the robustness direction and how these two things interact um so just to be clear about what we mean I think this uh this this this video has been widely circulated so I apologize if uh for people who are already familiar with it but uh you know uh about about a year and a half ago um we uh you know we were we were training lots of flash games using RL and uh you know there there happens to be this boat race game so you know I I just set lots of uh lots of games running with uh with a reward function so the way this boat race Works you're supposed to go along a course and you're supposed to you're supposed to finish the course but the way the reward function works and it's hard to reach in and write a different reward function is you get you get points for uh you know these markers along the way that are mostly along the course but it turns out there's this uh this little Lagoon um in the corner of the course where you can go around in circles and get more and more powerups um and that turns out to get you a faster rate of powerups and actually finishing the course there's nothing wrong with RL here the system did what it was supposed to do but it it identifies the weakness of the connection between a reward function and the final Behavior the reward function that you specify that you may think corresponds to some behavior that you want May in fact correspond to very different behaviors and you get no feedback on that other than just uh finding out what the system does right when I first trained this I trained along with a bunch of other games two days later I looked this I'm like what what in the world has what in the world is this doing um it doesn't make any sense um and then I thought about it a lot I'm like oh of course that makes sense um and you know so the more powerful the system is the more autonomous it is the less a human is paying attention to it the more potential there is this is like you know I can generate dozens of these examples but uh you know a robotic system where we forgot to make the table totally fixed it has a high mass but it's not fixed um turns out to be easy it's hard to send the send the puck exactly to the point you want it to be it's easier to send the puck observe if it's going to be a little to the right or a little to the left and then nudge the table so that it hits it exactly um it's very it's very clever it's a correct solution to the problem but the problem was not the right problem um so the general approach that we've kind of hit on and we've been pursuing the strategy for about a year a year a year a year and a half is that the this training Loop is too long right the human at the beginning says here's a mathematical reward function like go go optimize this then you look back at the end of training you might get the right thing you might not if you don't you have to go back to the beginning or you know maybe the system's already doing something dangerous so maybe we should have humans be involved interactively in the training process right when we train humans to do things it's not just like here's your goal go off tell me what you did you know two weeks later um so if we if we do this is there a way that we can use a human to decide what the reward function is in a continuous way that's more reliable that's more naturalistic so that the system ends up imbued with human goals and values but it's able to act faster and bigger than human scale once it's trained it knows what the human wants and it does it um example of this is like instead of RL we can learn from demonstrations but that kind of has the same problem a human demonstrates it AI system copies it and there's kind of it's It's hard to do than the human it's hard to course correct it's hard for the human to say you should be doing this instead of this um and traditional RL has has a loop that's too long um so the kind of first effort we did in this direction was uh we called it deep RL from Human preferences so the idea is you know I want this thing to do a backflip and I you know it's hard to mathematically specify the reward function for a backflip we tried by looking at all the individual joint angles um and you know it turns out it just gives you something very very like you know very awkward looking um but what we do instead is and you know this is now running for the second time but a human looks at the behavior of the system and says which of these is more like a back flip than the other the system just starts by acting randomly it has it has just like a random reward function and uh human gives it feedback on what what is more like what the human wants and then the AI system you know like the the RL system has a reward predictor and it tries to fit a reward predictor consistent with what the human says the human prefers and then in the background it's running a whole bunch of copies of uh of of the RL environment and uh those copies optimize the reward function that it learns from the human the human only ever has to give feedback on a very small fraction of the AI systems Behavior doesn't have to see everything it does just has to get enough samples to give the you know to give the policy an idea of what the reward function should be so another way to put it is the human trains the reward function and the reward function trains the RL system um so uh what what I just said can be kind of pictured in this uh the gray part is uh the standard setup for for for reinforcement learning um where you have an RL algorithm the environment they exchange observations and actions and there's a reward that kind of that kind of you know come comes from The Ether that was ultimately specified by a designer but that isn't thought about as being part of the problem um here what we have is that reward starts out being completely random and the human sees examples of the agent's behavior and feeds them to a reward predictor so the reward predictor is changing and improving and adapting over time and the RL system is both learning from the existing reward function and adapting to changes in the reward function uh so we did several versions of it in our paper and we found that uh a simple Active Learning technique uh helped relative to random it didn't help by that much but but it helped uh the idea is you train an ensemble of reward predictors that are trained on subsets of the data um and uh that that allows you to have kind of like semi-independent predictors and you can pick examples whether predictors are uncertain meaning that those are parts of the space or situations where there's just the reward predictor has more uncertainty and so would like more feedback from the human that helps you can go much more sophisticated in that direction right the system could like ask the human like what you know like you know what what am I doing that's wrong what am I doing that's not clear the you know the human could say to the system like you know I I'd like you to produce some examples of of this right and then it becomes much more like a like teacher to human teacher to human pupil teaching process and a lot of what we're doing is kind of going in that direction but we kind of have to start so uh imitation learning um has the following limitations um when you do imitation learning uh you uh except for noise reduction which is usually a small effect you can't perform better than the human does so as we'll see in some future tasks here uh there cases where learning from preferences allows you to perform better than how the human does the reason for that is with imitation learning you just do what the human does here you learn what the human wants and once you learn the reward function you could do it better than the human right so consider something like you know if I didn't know how to play go I can teach you the rules of Go and then you can do RL on the rules of go and get much better than me or you can just copy my moves if you're just copying my moves you can never do better than me if I teach you the rules and then you use RL to learn to learn how to play you can you can then that wasn't good um you can then in principle do better um uh another another difference is uh you tend to get uh kind of like better sample samp you tend to get like better sample efficiency you can come up with strategies that a human wouldn't would you can come up with strategies that a human wouldn't have thought of um and uh many tasks uh a human just can't do um so actually this backflip task it's actually very hard for a human to demonstrate that task um like you'd have to get a VR set up and if we look at like the tasks of the future right where you know like let's say I want to defend you know a large corporate it Network or something and I want to respond to threats in real time um that's just something where I I can't get training data from a human I'm asking the machine to do things that a human can't can't do which is what we ultimately want AI systems to be able to do does that kind of answer the question yeah another one is if what if the human doesn't know like the preference is kind of maybe doesn't know intentionally or unintentionally yeah um so we have an option in this paper for uh basically I don't know um or I think we had separate options for I don't know where it just throws out the data or uh these two look about the same in which case it like weights them equally in the predictor um and uh in yeah uh so that's that's easy to incorporate I think ultimately the communication needs to be in terms of language and not in terms of clicking left or right and then that will kind of like make a richer space for doing things and saying I don't know or like show me some other examples these things aren't comparable at all will then become much more common um so the nice thing about this is given an environment uh without changing the code at all only changing what the human provides as feedback you can get totally different behaviors so in about half an hour a human can train this this RL system this is like simple simple Atari Enduro game uh I can train it to do the usual thing which is to uh to race ahead of all the other cars but I can also train it to go exactly at the at the same speed as other cars and when it does that um you know uh it's able to actually get ve very effectively like you know stay exactly even with other cars which isn't which isn't easy you have to go at kind of exactly the same speed and match their speed and so exact same code just the human provided different different feedback um one thing we show is if we don't give you the rewards for Atari games we just hide this hide them from you uh humans uh giving feedback on basically you know trying to get the system to get the highest score that it can works really well on the kind of right of each panel those like uh colored bars that are moving that represents how much reward the system is thinks it it's getting or just how much how much it thinks a given action is good so if you look at the breakout case uh when the ball hits the paddle instead of um so on the left when the ball hits the paddle instead of uh you know inste instead of instead of the ball going to the bottom it says yeah I got a lot of reward from that same with pong uh the uh when it surfaces to get oxygen in sequ it's very very very high uh very high reward level so the predictors seem to correspond to what you know to what human would say is is good behavior which is not surprising because a human trained them um so uh we did did a bunch of uh we did a bunch of experiments and you know with fixed reward Atari games your goal is just to do as good as you would if you knew the reward right so uh you're like hiding the reward from yourself and you're trying to learn the reward from a human so most of the time it does uh it does almost as good but actually there are cases where it can do better where in Enduro uh the the algorithm we used uh uh a3c has trouble learning Enduro because of sparsity of the reward but a human actually helps to shape the reward right in Enduro you have to like kind of like rev the control stick to go at a certain speed in order to get in order to get any reward at all so you can start you can start to move and the RL system doesn't give you any reward and then you have to keep moving faster and faster to to get reward and some some algorithms never figure that out but the human will basically say okay yeah you you went ahead you made progress that's better than when you're not moving and so little by little with just with a feedback points that can lead the system and so the human can shape the reward and there are actually cases like the uh the the curve for Enduro in the bottom right where you can actually do better than the human did or you can actually do better than than a standard than a standard RL algorithm did even though you had less information instead of knowing the right reward function you just had a human indicate the reward function also works for a bunch of uh kind of uh like uh simulated robotics tasks we haven't really tried in the real world um relevant to the question about demon demonstrations uh We've we actually uh follow this up with an effort combining human feedback with demonstrations um so what that did is you know there's some task a human can do it but we'd like an we'd like the RL system to do it better um however we can initialize from uh human human demonstrations um the a system copies that but then on top of that on top of that initialization we run RL RL with human preferences so there's no no reward function there's no like programmatic reward function anywhere it's entirely learning from humans but the first step the human demonstrates and then the second step the eye system copies and the human says it would be better if you could do it this way and again the Second Step allows you to exceed Human Performance or do tasks that humans can't do right the humans like this is as well as I know how to do it the eye system copies that the the the human says uh you know okay well I wouldn't be able to do this myself but if you moveed back back and forth really quickly and shot those two ships that would be better than if you didn't do that the AI system is capable of that and so it can bootstrap itself to kind of beyond beyond human capabilities um more recently and we don't have any work out on this but I think we will soon uh We've started applying this to natural language um so in the last year or so um there have been kind of big a lot of progress on large language models like open ai's GPT and Google's Bert um where uh you just uh take a big Corpus of text you train uh just a just a big Transformer model to uh to predict the next word or the next token um and that allows you to generate very coherent text and can also be fine-tuned to solve a lot of linguistic tasks um so one one idea there is can we fine-tune that via RL from Human preferences right I have a language model it's seen a lot of text some of it's happy some of it's sad uh five five minutes uh yeah uh you know some some of its uh formal statements or informal statements uh some of its jokes the language model maybe has some idea in its internal representation of the difference between those things but uh you know uh if I just sample from the language model it just kind of gives me random samples of stuff um so can I push this language model in directions and to produce behaviors that only a human can specify that can't be specified programmatically um so things like statement that rhyme or are are statements that are in iambic pentameter um could you make a system that is you know from the logic of learning from Human preferences is a better poet than any human could be or something like this um or you know makes uh makes like very positive sentiment statements that are you know that it's hard to find enough enough positive sentiment statements to to copy from um so that's the direction we're going in and then I think you know like a long-term vision for it would be you know we would you know we want a system that basically has an ongoing dialogue with with a human the human asks it to do something really complicated like planning and executing a mission to Mars um you know the system kind of kind of clarifies asks asks for instructions while it's learning and while it's doing the task um and we make sure that that things like pathological solutions to the problem don't happen one way to get to Mars really quickly um is you know to escape from Earth and Propel Yourself by dropping a bunch of like nuclear explosions back at Earth um that would work that would get you to Mars um this project called the Orion Project in the 1950s although plan was to detonate the nuclear weapons when they were like far away from Earth um but this is not a solution we would favor um how do we make sure that uh that that AI systems uh don't don't do things like that um cool uh so I've only talked about a subset of what uh the safety team is working on but but uh you know we have around 15 members here some of some of these uh efforts were done in collaboration with uh with with deep mind and various various academic institutions we have a number of uh kind of interns and faculty Affiliates um but uh you know we're uh safety team is uh is is continuing to hire and we're we're interested in you know further advancing these and these and other areas thank you so much Dario hello everyone uh we are now at the conclusion of today's morning talks but before we break for lunch I would like to invite all of the volunteers who are joining us today uh from openai and uh Berkeley and uh newa school to please come up to the front so as we proceed into the afternoon hackathon and breakout sessions uh these will be the faces that will be around to help you uh that you should ask questions to uh these people are all uh talented researchers or contributors or engineers in this space um many of these people are uh employees of open AI um ever and uh we also have I think the only person here who's not currently employed by openai was previously employed by open AI um so if you want to pick our brains about what it's like here what we do why it matters um please feel free uh can we just uh have everyone give maybe a sentence to introduce themselves sure uh I'm Daniel I work on the safety team as an M engineer working on the uh language fine tuning project uh from Human feedback uh yeah Matias I'm on robotics uh I'm Ethan I'm on the safety team working on modelbased Earl and safe Exploration with Josh I'm Carl I'm on the games team primarily studying transfer learning and procedurally generated environments uh my name is Dylan I'm a PhD student at UC Berkeley and I Main work on preference learning uh I'm Amanda and I'm on the policy team here open AI I'm AR I work on the safety team on safe exploration all right and uh another thing that I want to say thank you all so much for being here today something that I hope we can do is really make this a useful experience for all of you uh and I hope that over the course of the day that you know you give us feedback about what you find helpful and not helpful and what it is that you're hoping to get out of this experience so that we can figure out you know how to help you get to that and uh and thank you so much please enjoy [Applause] lunch

Original Description

Opening & Intro to RL, Part 1, by Joshua Achiam at 25:11 Intro to RL, Part 2, by Joshua Achiam at 1:48:42 Learning Dexterity, by Matthias Plappert at 2:26:26 AI Safety: An Introduction, by Dario Amodei at 2:58:00 Recorded on Feburary 2, 2019. Learn more: https://openai.com/blog/spinning-up-in-deep-rl-workshop-review/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from OpenAI · OpenAI · 21 of 60

← Previous Next →

Robots that Learn

Robots that Learn

Emergence of Grounded Compositional Language in Multi-Agent Populations

Emergence of Grounded Compositional Language in Multi-Agent Populations

OpenAI + Dota 2

OpenAI + Dota 2

Dendi vs. OpenAI at The International 2017

Dendi vs. OpenAI at The International 2017

Competitive Self-Play

Competitive Self-Play

Learning a Hierarchy

Learning a Hierarchy

Physical Spam Detection

Physical Spam Detection

Ingredients for Robotics Research

Ingredients for Robotics Research

OpenAI Five: Dota Gameplay

OpenAI Five: Dota Gameplay

Learning Dexterity

Learning Dexterity

Learning Dexterity: Uncut

Learning Dexterity: Uncut

OpenAI Five Benchmark: Post-Game Analysis

OpenAI Five Benchmark: Post-Game Analysis

Investigating Model Based RL for Continuous Control | Alex Botev | 2018 Summer Intern Open House

Investigating Model Based RL for Continuous Control | Alex Botev | 2018 Summer Intern Open House

Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House

Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House

A pathway to more efficient generative models | Will Grathwohl | 2018 Summer Intern Open House

A pathway to more efficient generative models | Will Grathwohl | 2018 Summer Intern Open House

Learning Dexterity | Alex Ray | 2018 Summer Intern Open House

Learning Dexterity | Alex Ray | 2018 Summer Intern Open House

Robust Vision-Based State Estimation | Hsiao-Yu 'Fish' Tung | 2018 Summer Intern Open House

Robust Vision-Based State Estimation | Hsiao-Yu 'Fish' Tung | 2018 Summer Intern Open House

Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018

Using Semantic Trees In Place of Sentences | Munashe Shumba | OpenAI Scholars Demo Day 2018

Reinforcement Learning with Prediction-Based Rewards

Reinforcement Learning with Prediction-Based Rewards

OpenAI Spinning Up in Deep RL Workshop

OpenAI Spinning Up in Deep RL Workshop

Arena Announcement and Closing | OpenAI Five Finals (6/6)

Arena Announcement and Closing | OpenAI Five Finals (6/6)

Co-Op Match | OpenAI Five Finals (5/6)

Co-Op Match | OpenAI Five Finals (5/6)

OpenAI Five vs. OG, Game 2 | OpenAI Five Finals (4/6)

OpenAI Five vs. OG, Game 2 | OpenAI Five Finals (4/6)

OpenAI Five vs. OG, Game 1 | OpenAI Five Finals (3/6)

OpenAI Five vs. OG, Game 1 | OpenAI Five Finals (3/6)

Pre-Match Panel Discussion | OpenAI Five Finals (2/6)

Pre-Match Panel Discussion | OpenAI Five Finals (2/6)

Opening Keynote | OpenAI Five Finals (1/6)

Opening Keynote | OpenAI Five Finals (1/6)

OpenAI Robotics Symposium 2019

OpenAI Robotics Symposium 2019

OpenAI Scholars Demo Day 2019

OpenAI Scholars Demo Day 2019

Multi-Agent Hide and Seek

Multi-Agent Hide and Seek

Solving Rubik’s Cube with a Robot Hand: Uncut

Solving Rubik’s Cube with a Robot Hand: Uncut

Solving Rubik’s Cube with a Robot Hand: Perturbations

Solving Rubik’s Cube with a Robot Hand: Perturbations

Solving Rubik’s Cube with a Robot Hand

Solving Rubik’s Cube with a Robot Hand

Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018

Music Generation | Christine Payne | OpenAI Scholars Demo Day 2018

Deephypebot | Nadja Rhodes | OpenAI Scholars Demo Day 2018

Deephypebot | Nadja Rhodes | OpenAI Scholars Demo Day 2018

Physics Net | Ifu Aniemeka | OpenAI Scholars Demo Day 2018

Physics Net | Ifu Aniemeka | OpenAI Scholars Demo Day 2018

Art Composition Attributes + CycleGAN | Holly Grimm | OpenAI Scholars Demo Day 2018

Art Composition Attributes + CycleGAN | Holly Grimm | OpenAI Scholars Demo Day 2018

Generating Emotional Landscapes | Hannah Davis | OpenAI Scholars Demo Day 2018

Generating Emotional Landscapes | Hannah Davis | OpenAI Scholars Demo Day 2018

Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020

Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020

Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020

Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020

Long term credit assignment with temporal reward transp… | Cathy Yeh | OpenAI Scholars Demo Day 2020

Long term credit assignment with temporal reward transp… | Cathy Yeh | OpenAI Scholars Demo Day 2020

Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020

Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020

Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay | OpenAI Scholars Demo Day 2020

Quantifying Interpretability of Models Trained on Coi… | Jorge Orbay | OpenAI Scholars Demo Day 2020

Towards Epileptic Seizure Prediction with Deep Network | Kata Slama | OpenAI Scholars Demo Day 2020

Towards Epileptic Seizure Prediction with Deep Network | Kata Slama | OpenAI Scholars Demo Day 2020

Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020

Universal Adversarial Perturbations and Language M… | Pamela Mishkin | OpenAI Scholars Demo Day 2020

Introductions by Sam Altman & Greg Brockman | OpenAI Scholars Demo Day 2020

Introductions by Sam Altman & Greg Brockman | OpenAI Scholars Demo Day 2020

Introduction by Sam Altman | OpenAI Scholars Demo Day 2021

Introduction by Sam Altman | OpenAI Scholars Demo Day 2021

Breaking Contrastive Models with the SET Card Game | Legg Yeung | OpenAI Scholars Demo Day 2021

Breaking Contrastive Models with the SET Card Game | Legg Yeung | OpenAI Scholars Demo Day 2021

Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021

Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021

Words to Bytes: Exploring Language Tokenizations | Sam Gbafa | OpenAI Scholars Demo Day 2021

Words to Bytes: Exploring Language Tokenizations | Sam Gbafa | OpenAI Scholars Demo Day 2021

Learning Multiple Modes of Behavior in a Continuous… | Tyna Eloundou | OpenAI Scholars Demo Day 2021

Learning Multiple Modes of Behavior in a Continuous… | Tyna Eloundou | OpenAI Scholars Demo Day 2021

Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021

Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021

Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021

Contrastive Language Encoding | Ellie Kitanidis | OpenAI Scholars Demo Day 2021

Characterizing Test Time Compute on Graph Structur… | Kudzo Ahegbebu | OpenAI Scholars Demo Day 2021

Characterizing Test Time Compute on Graph Structur… | Kudzo Ahegbebu | OpenAI Scholars Demo Day 2021

Studying Scaling Laws for Transformer Architecture … | Shola Oyedele | OpenAI Scholars Demo Day 2021

Studying Scaling Laws for Transformer Architecture … | Shola Oyedele | OpenAI Scholars Demo Day 2021

Feedback Loops in Opinion Modeling | Danielle Ensign | OpenAI Scholars Demo Day 2021

Feedback Loops in Opinion Modeling | Danielle Ensign | OpenAI Scholars Demo Day 2021

Creating a Space Game with OpenAI Codex

Creating a Space Game with OpenAI Codex

“Hello World” with OpenAI Codex

“Hello World” with OpenAI Codex

Talking to Your Computer with OpenAI Codex

Talking to Your Computer with OpenAI Codex

Data Science with OpenAI Codex

Data Science with OpenAI Codex

The OpenAI Spinning Up in Deep RL Workshop provides a comprehensive introduction to reinforcement learning, deep learning, and their applications. The workshop covers various techniques, including Q-learning, policy gradients, and domain randomization, and highlights the importance of human feedback in shaping reward functions.

Key Takeaways

Set up a reinforcement learning environment
Implement Q-learning and policy gradients
Apply domain randomization techniques
Integrate human feedback into RL systems
Develop and optimize deep learning models

💡 Human feedback is crucial in shaping reward functions and ensuring that RL systems produce desired behaviors.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

How to prepare TIC teacher exams in Spain with AI (oposiciones 2026)

Prepare for TIC teacher exams in Spain using AI with these actionable steps

Why I built a simple AI provider wrapper (and you might too)

Learn why a simple AI provider wrapper is useful and how to build one for streamlined AI integration

Dev.to · zhongqiyue

This ChatGPT Prompt Replaced 3 Hours of PowerPoint Work

Learn to generate pitch-ready presentation decks in 5 minutes using ChatGPT, replacing hours of manual work

This ChatGPT Prompt Replaced 3 Hours of PowerPoint Work

Learn to generate pitch-ready presentation decks in 5 minutes using ChatGPT, replacing hours of manual work

Medium · ChatGPT

Salesforce Flow New Features (Summer '26) | Open Record, URL & Show Toast Messages