RL Course by David Silver - Lecture 5: Model Free Control

Google DeepMind · Beginner ·🎮 Reinforcement Learning ·11y ago

Key Takeaways

Covers model-free control in reinforcement learning with David Silver's lecture

Full Transcript

in some sense you know everything in the course up to this point has been leading to this lecture okay we're going to finally find out how if you drop your robot or agent into some unknown environment and you don't tell it anything about how that environment Works how can it figure out the right thing to do how can it maximize its reward in that environment how can it figure out the actions that lead to the most possible future reward um without telling it anything in advance about how this environment Works um so that's the goal of this class um model free control and after we've seen this class in the subsequent lectures we'll look at how to scale up how to get to more realistic large scale domains but the the core techniques are essentially building on what we saw in the last lecture and bringing those into a control setting now so roughly the proceedings for today we'll have a brief introduction and then what we're going to look at is two different paradigms for doing control on policy and off policy um which we'll understand the distinction between those um essentially on policy means learning sort of on the job um whilst you're following the behavior that you're learning about and off policy means learning whilst following someone else's behavior um and so we'll start with on policy which is a simpler case and again just like last lecture we're going to follow the format of starting with Monte Carlo control before we move on to temporal difference learning to get more efficient lower variance techniques um so how are we going to do that well um last lecture we basically built up the toolkit that we're going to need um last lecture we saw how to do model free prediction how to evaluate a given policy if you drop your robot into a given mdp and you ask it how much reward will it get if it follows a particular Behavior last time we saw how to do that using Monti Carlo evaluation and and temporal difference learning um and that just helped us to estimate the value function of some unknown mdp and now that basic toolkit is what we're going to use to now do control methods where we want to not just estimate the value function but actually optimize it find V Star find qar find the best possible U Behavior find the most reward you can extract from this MVP but we're going use the same tools and we're just going to iterate them and essentially use what we had before kind of as an inner loop for getting to the best possible behavior um so just to get you back in the flavor I know we've had a um a week off reading week so so I think it's helpful just to get um a reminder of why this is interesting you know why should we care why should we care about this particular flavor why why this problem of unknown mdp reinforcement learning and the answer is that in most interesting problems and I put up just a few interesting problems here that that you know fall under this umbrella so you know controlling an elevator getting some car to park itself um trying to automatically steer some ship or control a bioreactor or um navigate a helicopter or do the logistics for an airplane or play RoboCop soccer or control a bot in Quake or manage a financial portfolio or learn how to fold proteins or figure out how to make a robot walk by itself or to play a complicated game um all of these problems um have mdps there are mdp descriptions of these problems in each case there's some complicated environment in which this problem can be described where there are some underlying Dynamics and some environment that has some state that um tells you how it's going to move around in the state but either um that thing is unknown to us we don't know what that environment is and so we have to resort to model free control to figure out how to do this um or the mdp might be known but it just might be so complicated um that it's actually too big to use except to sample it and if you can only sample it you're essentially back to the same case as model fre control where where you really the only way you can access this model is by just trying things trial an error and see what happens because you can never do even one step of integrating over some vastly complicated model might be too expensive um you know the the real world's an example where you know the the underlying dynamics of the real world are extremely complicated even if we knew them you wouldn't really want to work at the atomic level and you know integrate over the the Dynamics of every atom moving around you would just take samples see what happens and work with that so model free control helps us to solve problems like these so that's why this lecture is interesting I hope okay so we're going to look at this basic um dichotomy between on policy learning and off policy learning so let's just understand what that means now so on policy learning I like to call it learning on the job so it's like you've got some policy you follow that policy and while you're following that policy you learn about it um so basically you actually a sampling experience by sampling actions from some policy pi and at the same time you're trying to evaluate that same policy that's what we mean by on policy planning that the actions that you take um determine the policy that you're trying to evaluate but that's not the only choice we can also do off policy learning which you can kind of think of as looking over someone's shoulder so you might have um you know some robot which is watching some other robot walk around and it's trying to figure out the optimal Behavior by just observing some other robot or maybe even a human behave so so maybe a human gives some example demonstrations um maybe some teleoperated arm and the human says this is how you should move this arm around um and now just from the trajectories that have been sampled from the human behavior um we could actually do off policy learning and we could learn to evaluate um the robot could start to learn for itself well what would happen if I did something else you know maybe there's some better Behavior and the robot can learn about that better behavior from samples of experience drawn from say the human behavior you don't have to learn about the behavior that you sample you can learn about other Behavior that's called off policy learning okay so we'll start with a simpler case on policy learning um and the main idea that we're going to use throughout this course is the same idea that we saw um a couple of um Leo when we're doing dynamic programming um and it's the basic framework of generalized policy iteration so just a quick refresh the slide so the main idea in um policy iteration is to alternate between two different processes where first of all we evaluate a policy um and then we improve that policy and we have this iterative process which I kind of um sketched up on this diagram here where you start with some value function and some policy and every step every time you go up um you basically evaluate your current policy to get you a new value function and every time you go down you act say greedily um with respect to that value function just greedy and so this alternating process we saw when we were dynamic programming actually does Converge on the optimal policy and the optimal value function um and it's just this cycle that can go round and round and round and the question is well what can we slot in for these two processes and what we're going to do today is we're going to vary the choices of what slots into these two pieces here um so to enable us to do model free control rather than in dynamic programming where everything was using full knowledge of the of the mdp model um so so far we've only seen in the dynamic programming examples of these where we our up arrows were some way to evaluate the policy and we saw things like iterative policy evaluation and the down arrows were things like greedy policy Improvement and now we're going to vary those two components and try other choices um so so let's make a first try okay let's say you know we want to do the simplest case let's start with Monte Carlo learning um so we're doing basically um Monte Carlo control that's this section we're in that chapter of the um of the lectures um and and so we're going to start with Monti Carlo control later we'll move on to TD control now if we're doing Monti Carlo control is it possible to just slip that in can we just swap in Monte Carlo policy evaluation instead of our dynamic programming that would be a way to evaluate the policy so remember Monte Carlo evaluation is just you just execute some trajectories and take the mean and that's your value of your of of your value function so what about doing that is our process for estimating the value of the policy what would happen if we just ran some trajectories Ed those trajectories to estimate v um and then applied greedy policy Improvement would that work any thoughts any problems with this idea do people understand what what's the suggestion is that we could take this General policy iteration framework and plug in a model free step for policy evaluation to just run some trajectories in the unknown environment see how well we do use that to estimate the value of the policy and then iterate by improving our policy um evaluate the new policy using by running some more trajectories evaluate our new Policy Act greedy with respect to that run to more trajectories Etc um and according to our um our knowledge of generalized policy um iteration this has to converge on the optimal policy so what's wrong can anyone think there's actually two problems with this okay good so if you're if when you're running your trajectories you follow your your policy then you're not going to EXP okay great so that's one of the two problems um we'll actually come to that one second so um but the point was which is correct is that there's an exploration issue um which is that if you act greedily all the time you don't guarantee that the trajectories that you're following explore the entire State space and so it might be that there's parts of the state space which are really interesting and actually have better um potential that you just never see okay there's one more problem with this any thoughts so that's actually the problem with the second step any problems with this first step with evaluating V in this way yeah you need to run a whole load of Trials to get a good measure of what the best best policy would be any given okay so so the point was that maybe it's slow to do this it might be um you might need a lot of episodes to actually make this work that's true and we'll improve on that by using temporal difference learning later um but let me move on so so so the issue is that we're trying to be model free actually um we want to be model free but how can you be model free um when you're using the value function V if you use the state value function and you only have the state value function and you want to act greedy with respect to your state value function you still need a model of the mdp to figure out how to act greedily with respect to that value function you only know the value of each state and so if I'm in some State and I want to know well which action is best I have to imagine what would happen for each possible action I have to roll forward the Dynamics one step to look at the value function my successor State we don't know those Dynamics we're trying to be model free okay so if you work with the state value function V you always need a model in or in order to do your policy Improvement because you need to have some one step of look ahead to get to your next state so the alternative is to use Q the alternative is to use action value functions action value functions enable us to do control in a model free setting and the reason is that we can do policy Improvement in an entirely model free way like if we have q and we want to do um greedy policy Improvement all we need to do is maximize over our Q values so we've got the Q values the Q values tells us from each state how good is it to take each different action now if we want to make a better policy what do we do well we just pick the action that maximizes our Q values that's the best action it's immediate you don't need a model to roll anything forward it's already there in our in the structure that we've we've created so by caching the values of our actions we um kind of reduce the burden of what we require from our from our our model we don't need to look at the model anymore we don't need to roll it Forward is that clear okay so so let's let's drop that in so what would that look like now so the proposal then is that we have now an algorithm which looks like this um where we start at with our a q value function now an action value function and some policy and every step we're going to do Monti Monti Carlo policy evaluation run a bunch of episodes um at the end of those we've got some estimate of the of our policy of Q Pi so we run a bunch of episodes we take the mean um from each state action pair we look at the mean return to estimate qsa um and that tells us how good that particular State action pair is taking that action from that state what is the mean return that you got do that for all states and actions that gives us a new value function we then act greedily with respect to our Q that's immediate we just take this argmax over our Q values that gives us a new policy we run that policy for a bunch of more um episodes um take the mean of all our state action pairs again and iterate um and again this results in um qar and piar or does it and the issue has already been raised by um someone in the audience um which is we still have one issue which is that we're acting greedily as our policy Improvement step and if you act greedily you can actually get stuck you can actually not see the states which you need in order to get get a correct estimate of the um of the value function so we're basically estimating the value of all state action pairs by Tri and error by running episodes but if we don't see certain States and actions then we won't evaluate them correctly and we won't select them because we haven't see how good they are um so that's the distinction now that we're running things we're not doing these full sweeps that we were seeing in dynamic programming U we're running things by actually interacting with the environment so we need to make sure that we continue to see everything so here's an example to make that a bit clearer okay um so this is something from the academic world where you can imagine there's some situation in life where you've got two doors and you're trying to pick the best door so in this cartoon one of them behind that door is tenure or one of them has tenure and one of them is flipping burgers at McDonald's um so so now we have to basically figure out by trial and error which door to go through so let's see say you get repeated trials and let's say the outcome of these doors is actually stochastic okay so you're just going to get some immediate reward this is called a the Bandit problem we'll see more examples of this in the subsequent lecture when we focus really on exploration but it's a bandit problem you've got two choices um door door one or door two um so what happens now if you open the left door um and you see a reward of zero okay so you get this reward of zero we going through the left door um so now think okay zero wasn't so great maybe I'll try the right door so you open the right door and you get a reward of plus one so if you're acting greedily there's only one logical thing to do which is to open the right door again let's say it gets a um a reward of plus three um so now your mean um after these steps you've seen 1 plus 1 1 plus three so your Monte Carlo estimate of the value of that door is now plus two the mean of those two um episodes that you've seen so plus two is clearly better than zero so if you're acting greedy you would pick it again um maybe you get a reward of um plus two again um and now you've got a mean of two um so you'll open it again um and then maybe your mean is just going to fluctuate a little bit around two let's say you always get a reward between say one and three um you'll keep on opening this door forever and the problem is that actually we don't really know what the value of this door is we've only tried it once and so you need to carry on exploring everything to make sure that you understand the value the true value of all of your options if you stop exploring certain actions you can end up making incorrect decisions getting stuck in some local um incorrect Optimum your beliefs can be incorrect and you need to carry on exploring to make sure that you really understand the values correctly okay so to people see how you can get stuck in this situation and just um keep on Mis evaluating so how do we do that how do we make sure that we carry on um well there's going to be a whole lecture on this but we're going to start with the simplest possible way to guarantee um that you uh Vis continue to visit all states and all actions Ely often um and so this is the simplest idea for ensuring continual exploration it's actually surprisingly hard to do better than this simple idea um although there are lots of more very U much more sophisticated algorithms and the simplest idea is what's called Epsilon greedy exploration and it's very simple all it says is that you flip a coin um with probability Epsilon um you choose a random action completely random a uniform random action with probability 1 minus Epsilon you choose the greedy action okay so you flip a coin if it comes up heads you act completely randomly to make sure you see everything if it comes up taals you choose the best thing that you you you know so far you take your your AR Max over your Q values so it's just like our greedy um policy Improvement but we've softened it a little bit to make sure that with some small probability you try all the other actions um so this might seem naive but it has some nice properties it guarantees that you continue to explore everything and it guarantees that you actually improve your policy as well as we'll see shortly so this is just a fancy way of saying precisely that you know this is saying the probability is if you flip the coin um so this is this Epsilon over m is the probability of what happens if it comes up tals and you you explore um and this one minus Epsilon um is the probability of actually if he comes up heads pick the greedy thing but you might also pick the greedy action just by random as well if if this tail thing comes up which why I have a bit more probability Mass on this one but the idea is very simple you just either take the best action or you explore it random okay so one of the reasons that Epson greedy is is a nice idea is that it actually guarantees that we get a step of policy Improvement so remember for this generalized policy iteration idea we want to alternate between steps of improving our policy and steps of evaluating our policy and so what we can see is that Epsilon greedy actually is a policy Improvement like you you start off with one Epsilon greedy policy and now the question is if you evaluate that policy so you start off with pi that's your previous Epsilon greedy policy if you evaluate that to get your V Pi the question is can we be sure that by taking Epsilon greedy step like by acting Epsilon greedily with respect to V Pi can we be sure that we've really made a better policy um and the answer is yes um so a simple proof here which basically says that um if you look at so we're just going to do the same idea we saw in the dynamic programming lecture where we're just going to prove this over one step and then we're going to argue that that whole thing telescopes by unrolling it using the B equation so over one step we can see that the value of our normal of our previous policy of our original Exon greedy policy if we take one step of our new policy so we want to show so that this thing here taking one step by new policy is basically um better than than our original policy the new value of our original policy so this is just saying well this is just an expectation um so this is our Epsilon greedy policy um we're going to say for one step there's some probability that we're going to take each action um multiply by the value of that action this is just unrolling the expectation and we can split that into U the probability of taking the greedy action um plus the probability of taking all other every action so this is what happens if you choose to explore and this is what happens if you choose to act greedily um and now what we can say is that well this thing of where you choose to act greedily if you act greedy if you choose the max the max of all your Q values has to be um at least as as good as any weighted sum of all of your actions right the max is better than any weighted sum of your of your actions of your Q values um so we're going to choose one particular waiting that sums to one here um and we're just going to say that the Max has to be better than that weighted sum of your Q values and now you can just collect terms together and we see that we've got 1us Epsilon on oneus Epsilon so this term over here now just cancels with this term over here um and you're just left with an expectation over your Q values which is your original value function okay you can look at this afterwards but the main thing I want you to take away is this idea that Epsilon greedy very simple idea but it really does guarantee that you're making progress so you really do make progress you start off with some Epson greedy policy um you evaluate it you make a new Epson greedy policy and it's definitely better than what you started with or at least as good as what you started with okay and then for the final to telescope it um refer back to um the dynamic programming lecture that explained how how this telescoping argument works it's the same idea okay so that's the one slide of math go on question this um proof you've just shown us doesn't really tell us anything about the rate exporation really because I think if you did like a um completely greedy policy then that would actually you know prove that's better than anything not Tak into account so I'm not sure I the question so the question can we say anything about how how frequently you need to like what like the how Epsilon effects is it telling us anything about um in our real problem when we execute our real problem how much we should be exploring um no this doesn't really in itself tell us anything about how rapidly we should be exploring so um but in general um we'll see a little bit about that in a second that you need to choose a schedule for your Epsilon and you need to make sure that that decays to zero roughly but we'll see um so so let's let's see how to plug that in now so let's let's we've got we've got two ideas now we started with this generalized policy iteration framework and we basically we've changed each of these steps now we have these two processes so for the Pol policy evaluation we're now plugging in Monte Carlo as our as our way to evaluate our policies using Q using the action values so we've got this one process here where we just run some episodes using our current policy um we estimate the value of all states all actions from all states um just from what we've seen so for every state action pair we look at the mean return we've seen that's our evaluation procedure um as someone's pointed out that might be a little bit inefficient and we'll see how to improve that shortly but now we've also changed our policy Improvement procedure to use this Epsilon greedy policy Improvement where we basically have this soft uh this softening of the greedy idea so every time we go down we're not getting to the completely greedy policy we've got a little bit of exploration left and so the policy is always to St Astic along here now and that stochasticity in the policy ensures that we at some rate at least explore everything in the environment now as pointed out by the question there um the rate at which you actually see it might be very very slow still there might be some um states that are way off down some strange trajectory and you have to explore and explore again and explore again to be able to even see those States um so this doesn't actually give you very strong guarantees of how long it takes to explore all of those States it just says ASM totically at least this thing this thing now really will um finds the optimal policy at the end which should say Q star this should all be Q rather than V okay so let's try and make this a little bit more efficient so the first thing to noce um again we saw in the dynamic programming lecture that it's not necessary in these kind of policy iteration Frameworks it's not necessary to go all the way um to the top of this line every time it's not necessary to fully evaluate your policy um sometimes you can just spend a few steps to to evaluate your policy and you've already got enough information there to guide you to a much better policy without wasting many many more iterations on Gathering more information um so what would that look like in the context of Monty Carlo well let's take it to its extreme and say why not do this every single episode so we're going to run one episode going to make the robot do one episode collect all the steps along that episode update the Q values just for those steps so basically get one return and so for every state and action that we've taken along that return we're going to update the mean value just of those visited States and tried actions along that episode so one episode one sequence of updates to our returns um and then improve our policy straight away why wait why wait to get more episodes of information when you can already improve the policy so the idea is always to act greedily with respect to the freshest most recent estimate of the value function like if after one episode you can already update the value function to something slightly better then why continue using your old estimate of the value function to generate Behavior you may as well use your new updated value estimates to continue generating behavior and so we basically we Chang the sort of um the the time the rate at which we um operate in this diagram becomes more frequent the frequency increases we don't go as high up we just run one episode um and then we immediately improve the policy one episode immediately improved the policy run one more episode improve the policy is that clear okay so this question's already come up and it's a natural question which is how can we really guarantee um that we find the best possible policy like what we really desire is pie star we really want to know the best possible behavior in this environment um so to do that we have to kind of balance two different things we need to make sure that we continue exploring to make sure that we don't exclude things which are better but we also want to make sure that asymptotically that we get to a policy where we're not exploring at all anymore because we want the best possible policy and the best possible policy will not include this random behavior that is extremely unlikely to be optimal um so how do we balance those two factors um and so one idea for balancing those two factors is called this um Glee idea greedy in the limit with infinite exporation and the idea of Glee is basically to come up with any um schedule if you like for for exploration such that two conditions are met the first condition is that you continue to explore everything that you basically make sure that all states and all actions are visited infinitely often um so that's like guaranteeing that you just um that you never miss out on anything that that just to make sure that every little part of the state space will be seen um every action from every part of the state space will be tried um so for example Epsilon greedy has that property that eventually if you behave randomly and you you try all possible actions then every um every reachable state in the state space and every action from those reachable States will be tried okay um and the second property is that we want to make sure that the policy eventually becomes greedy and needs to become greedy because we need this thing to satisfy the Bellman equation and the Bellman equation is basically the bman optimality equation basically requires a Max in there not just some soft thing and we want to make sure that eventually we're acting greedly with respect to Q we're really taking the max of the arax of our Q values eventually and so one way to achieve this by no means the best way but but certainly one simple idea um is to choose an Epsilon greedy policy and just Decay Epsilon slowly toward zero um so for example if you choose a hyperbolic um schedule so each episode you basically on the Ki episode you set your Epsilon to say one over k um that satisfies these two properties that you will see everything infinitely often um but asymptotically you become closer and closer to to Optimal closer and closer to the greedy policy okay so that's one way to balance things so what does that look like in practice um well we can make an algorithm now um it's called this Glee Monte Carlo control um so this algorithm basically we start off by sampling episodes um so again we've just got our robot we throw it down we run one episode um that generates a trajectory of States actions rewards State action reward until we get to some terminal State um we sampled that from our current policy pi and then for each state in action that I visited so I got to this state and I picked this particular action for each of those States and actions we update um our action value and the way that we do that is just by counting how many times we've seen that state action pair um and doing an incremental update to the mean so this is just our incremental update to the mean which says that if we consider that particular State action pair this one over here this state and this action um how many times have I tried that um well let's just adjust um our previous me which was this we're going to update a little bit in the direction of the return that we just got and the amount that we have to adjust it to get a correct mean estimate is this one over n now what's interesting about this is we're not taking a mean um in the way that you think of a it's not like a statistical mean um over um some IID quantity now that the policy is actually changing over time we're iterating over our policy we're improving our policy but we're basically taking return returns from better and better policies into account so as we improve our policy we're Gathering more and more statistics our policy starts to get better and we collect all of those statistics together to get us one overall mean of how good it is and the Glee property basically ensures that over time that these statistics that we're collecting really sort of converge on getting the mean return for um this these policies sort of get more and more like the greedy policy and so we basically find out more and more information about the policy that we really care about that the past kind of gets dominated eventually by um by these new policies um and so what we're going to do now is we're going to iterate over this whole process so we're going to sample the Casi episode in this way um update our Q values and then improve our policy so this was the policy evaluation step now the policy Improvement step just looks like this where we can for example set Epsilon to its new value one K and now we're just going to act Epsilon greedily with respect to these new Q values and those Q values are going to be the same everywhere apart from the states and actions that I just visited so we only update those Q values along one trajectory but that's already enough to change the behavior that we we actually see there so in practice if you're actually writing this you know you don't need to explicitly store Pi you just store q but when you actually come to select actions you always just look at the freshest most recent q and you flip a coin and you either act um you either pick the maximum Q or you pick a random action um so this Pi is kind of implicit and all you need to remember is your one overk schedule that you're using to determine the bias of this coin that you're using and this algorithm actually finds the optimal action value function so this is our this is really you know this is our first full solution this is something which you throw it into any mdp and you just let loose with this thing um and it will find the right solution and it's considerably more efficient um to run it in this way um where you actually update the Q values every single episode you update your Q values and then immediately improve your policy rather than having to generate a whole batch of episodes thousands of episodes to just get one evaluation of que so this actually works much better in practice okay are people clear about this algorithm like the idea it's a good time to ask questions if you're not because we're just going to keep on adding to these ideas okay good I'll take silence to mean comprehension um okay question and I can't remember if you addressed this in an earlier lecture but is there any particular um idea behind the initialization of the values for Q or do you do random initialization okay so the question is does it matter how you initialize Q um and in terms of the theory you can start with anything it will still converge in practice of course it helps to have a good estimate of Q the closer you are to the optimal Q values the faster this thing will will um will converge um it's a bootstrapping uh sorry it's not bootstrapping like with TD but still you um would basically starting from these Q values and we're we're updating incrementally from um ah actually sorry sorry that's not true with this particular schedule let me let me back off that we'll see that for later versions where we have an incremental update um so for this particular algorithm it makes no difference because we're just taking the mean um and and so the very first time you see a new state action pair you will replace any Q value that you started with completely um because n will be one you completely replace your old Q value with the with that return that you saw right so for this particular algorithm it makes absolutely no difference what your initial Q values are in practice um and in the sub the rest of this lecture we'll see algorithm where you don't have a like this perfect mean instead you have some non-stationary estimator that has something like a constant Alpha here and for those algorithms it matters much more what Q values start with because you're incrementally moving on from that initial start value and then your start value the closer you are to Optimal the better so yeah sorry correct myself okay question then I'll move on you said it's more efficient to upate for every episode yes but um if you do it for many epis you to okay so so that's a great point so so I guess yeah I'm just focusing on serial computation for now um there are many possibilities for exploiting parallel computation and you're right that you would want to that parallel computation introduces constraints on what you can and can't do and um and you may want to it's almost I think the same is roughly still true with parallel computation that you still want to use the freshest possible Q values that you can um but with parallel computation um those might necessarily be a little more stale than the very latest episode to come in because you simply don't have access immediately to all of that parallel computation okay so let's just do a quick example so we had this multical example in a previous lecture um in um last lecture we just looked at evaluation we basically said okay if I send you into a casino and I tell you that you have to um stick um and let um basically unless you're always going to ask for a new card unless you're on 19 or 20 or 21 I think was the policy we had before um and then I want to know how much money would you win or lose in that casino so that's what we looked at before and we had a particular plot of the value function um and it had a particular shape to it and now what we're doing is we're actually running this um algorithm we saw on the previous slide um to generate the optimal value function so now this is the question you probably really want to know which is you know if I walk into a casino what policy should I play to get the maximum possible amount of of of reward um and again real casinos use slightly different rules don't try this um but I'm not responsible if you lose all your money um and and so this basically tells you um now we've run our Monte Carlo we've iterated through several process is where each episode we update our value function a little bit we actually store Q internally but just to make the plots um easy to look at we're now basically summing over all of the actions to estimate varar from from our Q so you basically have some Q um that generates some Epsilon greedy U policy and then you alternate backwards and fors by running these episodes just like we saw in the last algorithm and you end up with a policy which looks something like this U which says that if you don't have an ace in your hand um you should actually have this quite interesting looking policy is the optimal policy to follow where depending on what the dealer has here and what cards you've got you know it's this threshold uh somewhere between like 16 and 17 where if you've got 16 or less it's better to ask for a new card um but there's some situations where a a dealer has some kind of um quite difficult card to play with where there's a high probability the dealer will go bust in which case you should probably just stick anyway and just wait for the dealer to go bust Okay so you kind of vary your policy according to that if you've got an ace in your hand um of course you can afford to to ask for more cards more often because that Ace can either be a one or 11 and then this is what the value function looks like so it's just a continuation of that last example to show you that you know this this process really does end up with usable information in the sense of you know you get these nice um shapes um out of these value functions that really characterize exactly the optimal behavior in this in this domain and you really end up with something which which is the optimal policy there's no way to do better than this um in this particular game that we've described this is the optimal way to behave okay why why would you ever stick at 11 um you can't go bust you wouldn't it always says to to hit on the 11 when dealer is showing between four and six so this line here says if you're on 11 you always hit this here so the line is between 11 and 12 so so it's like a um a grid and each cell of this grid is indicating what to do so this shaded this this could be shaded in if you like sh you for 11 you should always um you should always hit but if you've got 12 and the dealer has a bad card there's some chance the dealer might go bust and so you should just stick on that 12 sometimes okay so we're now going to move on to Temple difference learning methods so just like the last lecture we kind of split things between Monte Carlo learning and then we moved on to TD learning and then we saw that there was a spectrum in between these two things we're going to follow the same process now but with control methods rather than evaluation methods we're really going to try and understand how can we actually gain efficiency by bootstrapping by using the same tricks that we saw last time um and gain a lot of the efficiency um we want to basically use TD why because it can be lower variance um because it's can be uh Run online um in continuing domains even if there's no ter you can still run this it can be run from incomplete sequences and what we'll see actually in this lecture is that there's an additional benefit using TD which is when you use off policy learning it becomes particularly important to use TV um and so the natural idea well let's try the obvious thing which is use the same generalized policy iteration strategy um but we we know that to do model free policy iteration we need to use q u we need to use Q so that we don't have to do any look ahead in our um to do our greedy or Epsilon greedy policy approvement and the idea is basically to use um now TD learning to estimate Q so let's just slot in TD learning um in place of Monte Carlo policy evaluation and continue to use Epsilon greedy policy Improvement and that's immediately going to give us an algorithm um that we can apply U which is probably the best known algorithm in in reinforcement learning um and the only difference we're going to do is because we're TD is can operate at this higher time scale so TD learning is something where you don't need to wait until the end of the episode to update your value function in fact you can update your value function after just one step that's one of the advantages it's online like you see one step of data you bootstrap you update your value function immediately and so again if we follow this General principle of always using the freshest most recent value function to to um pick your actions what we're going to do is we're going to increase the frequency of our policy Improvement to be every single time step we're going to improve our policy okay so the general idea is called Sasa um so why is it called Sasa well this diagram should illustrate it we're basically starting off uh in some State action pair that's this black dot a decision node we're basically choosing an action we're going to randomly sample up policy um we're going to see a reward R we going end up in some new state S Prime and we're going to our policy again to generate a prime so there's two steps of um of sampling sorry we're not sampling our policy yet so we start off with State here um we're asking a question about this state s and this particular action a we're going to sample from the environment to see what reward we end up in and what state we end up in here um and then we sample our own policy at that next state so we end up with s r s prime a prime or Sasa algorithm okay so s Sasa basically indicates a particular update pattern that we're going to use based on Sasa here um and so what do we do we're basically going to start with our Q values we're going to move our Q value a little bit in the direction of our TD Target reward plus our discounted Q value of the next State minus the Q value of where we started so again it's like I was in some State and I'm considering taking some action and what I want to know is um if I actually take that action look at the reward I got and then the value of the next action which I would take that gives me an estimate of the value of this policy and I'm going to use that estimate to update the value of the state action pair I started in okay that's the general idea this comes from the Bellman equation for Q U we've seen this hopefully before becoming familiar with these type of ideas so that's a sassa update okay um so now we're going to take these Sasa updates and we're going to plug them in to our generalized policy iteration framework so every single time step we're going to move up and down in this diagram every single time step so each time we take a step we're going to update our value function by applying one step of Sasa so for the state and action that we were just in we're only updating the value function for that one state action pair of that time step by applying this update so I was in this state action pair I ended up in some new state action pair I'm going do One update to my Q value for that single entry of my table um and now I've already changed my value function like something's different if I end up back in that state action pair again which I might in some loopy environment I want to make sure that I use the latest most interesting um most upto-date best information um that I've got from my my policy evaluation so every single time step we're going to to improve our policy as well by acting Epsilon greedily with respect to the latest value function so again what does this mean in practice it means that you know you're just thring Q in your memory you've just got this big table of of Q values for all states and all actions and every step when you actually come to take an action you just flip a coin you look at your Q values um and if it comes up heads then you look at your Q values and pick the best one if it comes up Tales you explore randomly um so the policy is implicitly represented by your Q values and every single step we're going to we're going to evaluate the latest action I took and update my policy just for that one state action pair okay so we really got this very rapid process of evaluating basa and improving the policy okay let's try and get to know this a bit better so what would an algorithm look like so I think it's actually useful in this case to just maybe step through some pseudo code it's really straightforward um so you can arbitrarily initialize this lookup table Q is just a lookup table in subsequent next lecture actually we'll see how to generalize Beyond these naive lookup table representations but for now it's just a lookup table every state every action has its own entry in this table we initialize this arbitrarily just say zero okay and now we're just going to repeat and every single step this outter Loop is just over episodes but really you know the inner loop is just every single step we're going to um take an action observe the reward observe the next state that we end up in um we're going to choose our action using our Epsilon greedy policy um and we're just going to apply this um Sasa update to that one step so update a little bit in the direction of my TD Target the reward plus the discounted value at the next data action pair um and then repeat so the next time we come around we'll already have a different policy because we're acting Epsilon greedily with the respect to our Q values if I end up back in that same state again for example in a loop I will already behave differently using the latest and freshest information stored in que okay yeah question s Isen ACC to as well s Dash is chosen Accord by the environment s Dash is the next state um so States sorry yes a so a prime um so the question was is a prime selected according to the same policy um so so yeah so a prime is selected using the your current policy and then you basically remember that and now when you come in you remember what your last a prime was that becomes your a at the next step you just kind of um remember your last a prime and that becomes your new a how does that the that you're choosing that I mean there's no reward coming from that last right the reward the that you're using there is from yeah um so the question is you know why does this even make sense as an as an update I guess um and so let me try and give some intuition and I think the right intuition to have is comes from from the Bellman equation that really what we're trying to do this is just a sample of the bman equation this this right hand side that we're moving towards this this thing here is Target um is a sample of the bman equation we're basically saying what we want to do is we want to consider an expectation over everything which happens in the environment over one step um which basically is the reward um plus whatever state you end up in next so that gives you your expectation of the environment and then this um and now we want to know what would happen under our own policy we want to know what would happen in under our own policy after that point we want the expectation under our own policy from that state onwards and that expectation under our own policy is precisely what we're trying to evaluate with Q so that's why it has to be evaluated we have to select actions using the policy that we're trying to evaluate this is an on policy algorithm fundamentally on policy uh we're selecting actions and we're evaluating that policy so Sasa is an on policy algorithm okay so you should be wondering well does this work um and just like Glee uh Monte Carlo um this version of Sasa actually will converge to the optimal policy and the only thing we require is again Glee policy so again you could choose your Epsilon greedy with this decaying schedule that would be a valid choice just to make sure that you continue to explore everything and that you eventually end up greedy um but you also need to be careful about the step size that you use so you have to make sure that your step size if you really want convergence you need to look at stochastic approximation Theory and you need to make sure that your step sizes basically follow these two conditions um and all this tells you is that the step sizes are sufficiently large that you can move your Q value as far as you want so you might need to move your Q value from whatever your initial estimate was to some very very large or very very low value if you're wrong and this thing basically tells you that um eventually the changes to your Q values become smaller and smaller and smaller the changes to your Q values have to vanish and become zero eventually otherwise you'll still have noise and jumping around in your policy so that's just St castic approximation Theory but really when you do all that if you do all the Machinery correctly this really converges in practice we don't worry about this and we sometimes don't worry about this either and Sasa typically Works anyway that'

Original Description

#Reinforcement Learning Course by David Silver# Lecture 5: Model Free Control #Slides and more info about the course: http://goo.gl/vUiyjq
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google DeepMind · Google DeepMind · 4 of 60

1 RL Course by David Silver - Lecture 8: Integrating Learning and Planning
RL Course by David Silver - Lecture 8: Integrating Learning and Planning
Google DeepMind
2 RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning
RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning
Google DeepMind
3 RL Course by David Silver - Lecture 2: Markov Decision Process
RL Course by David Silver - Lecture 2: Markov Decision Process
Google DeepMind
RL Course by David Silver - Lecture 5: Model Free Control
RL Course by David Silver - Lecture 5: Model Free Control
Google DeepMind
5 RL Course by David Silver - Lecture 6: Value Function Approximation
RL Course by David Silver - Lecture 6: Value Function Approximation
Google DeepMind
6 RL Course by David Silver - Lecture 4: Model-Free Prediction
RL Course by David Silver - Lecture 4: Model-Free Prediction
Google DeepMind
7 RL Course by David Silver - Lecture 3: Planning by Dynamic Programming
RL Course by David Silver - Lecture 3: Planning by Dynamic Programming
Google DeepMind
8 RL Course by David Silver - Lecture 10: Classic Games
RL Course by David Silver - Lecture 10: Classic Games
Google DeepMind
9 RL Course by David Silver - Lecture 7: Policy Gradient Methods
RL Course by David Silver - Lecture 7: Policy Gradient Methods
Google DeepMind
10 Google DeepMind: Ground-breaking AlphaGo masters the game of Go
Google DeepMind: Ground-breaking AlphaGo masters the game of Go
Google DeepMind
11 Match 1 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Match 1 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
12 Match 2 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Match 2 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
13 Match 1 15 min Summary - Google DeepMind Challenge Match
Match 1 15 min Summary - Google DeepMind Challenge Match
Google DeepMind
14 Match 3 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Match 3 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
15 Match 2 15 Minute Summary - Google DeepMind Challenge Match 2016
Match 2 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
16 Match 3 15 Minute Summary - Google DeepMind Challenge Match 2016
Match 3 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
17 Match 4 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Match 4 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
18 Match 4 15 Minute Summary - Google DeepMind Challenge Match 2016
Match 4 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
19 Match 5 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Match 5 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
20 Match 5 15 Minute Summary - Google DeepMind Challenge Match 2016
Match 5 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
21 DQN SPACE INVADERS
DQN SPACE INVADERS
Google DeepMind
22 DQN Breakout
DQN Breakout
Google DeepMind
23 Asynchronous Methods for Deep Reinforcement Learning: Labyrinth
Asynchronous Methods for Deep Reinforcement Learning: Labyrinth
Google DeepMind
24 Asynchronous Methods for Deep Reinforcement Learning: MuJoCo
Asynchronous Methods for Deep Reinforcement Learning: MuJoCo
Google DeepMind
25 Asynchronous Methods for Deep Reinforcement Learning: TORCS
Asynchronous Methods for Deep Reinforcement Learning: TORCS
Google DeepMind
26 Differentiable neural computer family tree inference task
Differentiable neural computer family tree inference task
Google DeepMind
27 StarCraft II DeepMind feature layer API
StarCraft II DeepMind feature layer API
Google DeepMind
28 DeepMind Health – Partnership with the Royal Free London NHS Foundation Trust
DeepMind Health – Partnership with the Royal Free London NHS Foundation Trust
Google DeepMind
29 DeepMind Health – Michael Wise – a patient's journey
DeepMind Health – Michael Wise – a patient's journey
Google DeepMind
30 Streams – a platform for a digital NHS
Streams – a platform for a digital NHS
Google DeepMind
31 DeepMind Lab - Nav Maze Level 1
DeepMind Lab - Nav Maze Level 1
Google DeepMind
32 DeepMind Lab - Stairway to Melon Level
DeepMind Lab - Stairway to Melon Level
Google DeepMind
33 DeepMind Lab - Laser Tag Space Bounce Level (Hard)
DeepMind Lab - Laser Tag Space Bounce Level (Hard)
Google DeepMind
34 Exploring the mysteries of Go with AlphaGo and China's top players
Exploring the mysteries of Go with AlphaGo and China's top players
Google DeepMind
35 Demis Hassabis on AlphaGo: its legacy and the 'Future of Go Summit' in Wuzhen, China
Demis Hassabis on AlphaGo: its legacy and the 'Future of Go Summit' in Wuzhen, China
Google DeepMind
36 The Future of Go Summit: AlphaGo & Ke Jie match 1 moves analysis
The Future of Go Summit: AlphaGo & Ke Jie match 1 moves analysis
Google DeepMind
37 The Future of Go Summit: AlphaGo & Ke Jie match 2 moves analysis
The Future of Go Summit: AlphaGo & Ke Jie match 2 moves analysis
Google DeepMind
38 The Future of Go Summit: Pair Go moves analysis
The Future of Go Summit: Pair Go moves analysis
Google DeepMind
39 The Future of Go Summit: AlphaGo & Ke Jie match 3 moves analysis
The Future of Go Summit: AlphaGo & Ke Jie match 3 moves analysis
Google DeepMind
40 Emergence of Locomotion Behaviours in Rich Environments
Emergence of Locomotion Behaviours in Rich Environments
Google DeepMind
41 StarCraft II 'mini games' for AI research
StarCraft II 'mini games' for AI research
Google DeepMind
42 Trained and untrained agents play StarCraft II full 1vs1 game
Trained and untrained agents play StarCraft II full 1vs1 game
Google DeepMind
43 DeepMind open source PySC2 toolset for Starcraft II
DeepMind open source PySC2 toolset for Starcraft II
Google DeepMind
44 ICML 2017: Test of Time Award (Sylvain Gelly & David Silver)
ICML 2017: Test of Time Award (Sylvain Gelly & David Silver)
Google DeepMind
45 Ke Jie and DeepMind's Go Ambassador Fan Hui review the 3rd AlphaGo vs Ke Jie game
Ke Jie and DeepMind's Go Ambassador Fan Hui review the 3rd AlphaGo vs Ke Jie game
Google DeepMind
46 Ke Jie and DeepMind's Go Ambassador Fan Hui review the 1st AlphaGo vs Ke Jie game
Ke Jie and DeepMind's Go Ambassador Fan Hui review the 1st AlphaGo vs Ke Jie game
Google DeepMind
47 Ke Jie and DeepMind's Go Ambassador Fan Hui review the 2nd AlphaGo vs Ke Jie game
Ke Jie and DeepMind's Go Ambassador Fan Hui review the 2nd AlphaGo vs Ke Jie game
Google DeepMind
48 AlphaGo Zero: Discovering new knowledge
AlphaGo Zero: Discovering new knowledge
Google DeepMind
49 AlphaGo Zero: Starting from scratch
AlphaGo Zero: Starting from scratch
Google DeepMind
50 Defining principles for tech companies in the NHS: DeepMind Health's Collaborative Listening Summit
Defining principles for tech companies in the NHS: DeepMind Health's Collaborative Listening Summit
Google DeepMind
51 A systems neuroscience approach to building AGI - Demis Hassabis, Singularity Summit 2010
A systems neuroscience approach to building AGI - Demis Hassabis, Singularity Summit 2010
Google DeepMind
52 Retour de Rémi Munos en France et ouverture de DeepMind Paris
Retour de Rémi Munos en France et ouverture de DeepMind Paris
Google DeepMind
53 Grid cells - Caswell Barry, UCL
Grid cells - Caswell Barry, UCL
Google DeepMind
54 DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: What our research shows
DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: What our research shows
Google DeepMind
55 DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: A Patient's Story
DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: A Patient's Story
Google DeepMind
56 Deep Learning 3: Neural Networks Foundations
Deep Learning 3: Neural Networks Foundations
Google DeepMind
57 Deep Learning 5: Optimization for Machine Learning
Deep Learning 5: Optimization for Machine Learning
Google DeepMind
58 Deep Learning 8: Unsupervised learning and generative models
Deep Learning 8: Unsupervised learning and generative models
Google DeepMind
59 Reinforcement Learning 1: Introduction to Reinforcement Learning
Reinforcement Learning 1: Introduction to Reinforcement Learning
Google DeepMind
60 Deep Learning 2: Introduction to TensorFlow
Deep Learning 2: Introduction to TensorFlow
Google DeepMind

Related Reads

📰
Proximal Policy Optimisation — The Clip That Made Policy Gradients Reliable
Learn how Proximal Policy Optimisation (PPO) makes policy gradients reliable in reinforcement learning
Medium · Machine Learning
📰
Deep Q-Networks — When the Q-Table Won’t Fit
Learn to implement Deep Q-Networks in Python for reinforcement learning problems where the Q-table won't fit, and understand their benefits over traditional Q-learning
Medium · Python
📰
Reward hacking in Reinforcement learning
Learn to identify and fix reward hacking in Reinforcement Learning, a crucial step in ensuring reliable AI decision-making
Medium · LLM
📰
Learning by messing up: A beginner’s tour of Reinforcement Learning
Learn the basics of Reinforcement Learning, from agents and rewards to the Markov property and Gym environments, and start building your own RL projects
Medium · Deep Learning
Up next
Middle Management Meritocracy: Shockingly Naive
iBankerU
Watch →