RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

Google DeepMind · Beginner ·🎮 Reinforcement Learning ·11y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs60%

Key Takeaways

The video lecture by David Silver covers planning by dynamic programming in the context of reinforcement learning, specifically focusing on policy evaluation, policy iteration, and value iteration. It explores how dynamic programming can be used to solve Markov Decision Processes (MDPs) and find the optimal policy.

Full Transcript

so in the last week we started to understand what it means to do um to formalize a reinforcement learning problem using a markof decision process um but we didn't really talk about how to solve those markof decision processes so today and for the rest of the course essentially we're going to start to talk about solution methods so we're going to start to look on the agent side how to build things that can actually solve these problems and one of the fundamental building blocks perhaps the most fundamental and oldest idea behind um not just reinforcement learning but control optimal control is is the idea of dynamic programming and planning by dynamic programming um so that's today's lecture we're going to start by introducing it understanding what the idea is what it means and we're going to talk about three basic fundamental paradigms to which dynamic programming can be applied first of all we're going to consider the problem policy evaluation um which is if someone gives you a policy how good is that policy if I wand around randomly how much reward am I going to collect in that mdp then we're going to see how we can use that um solution method that we've come up with to evaluate how much reward we're going to get to build our first overall solution method that can find the optimal policy for an mdp and that's policy iteration which basically takes this idea of evaluating policy as an inner loop and loops around making our policy better and better and better until we find the best possible the maximum amount of reward we can get from the MVP um and then we're going to talk about another approach uh very related called value iteration which works directly on value functions and doesn't work in space but works directly to make the value function better and better and better by applying the Bellman equation iteratively until we actually arrive at the optimal solution to the mdp um then um time alone we'll talk about some extensions to dynamic programming um which lead to um a wide variety of different ideas leading into the subsequent lectures where we'll actually pull out lots of the core ideas of policy evaluation policy iteration value iteration and we'll turn them into more General reinforcement learning methods um so what's the distinction why we talking about um uh planning today and reinforcement learning tomorrow remember these are different problem settings we'll discuss that in just a second and finally this may just be something for the notes depending on how much time we have um but I'd like to at least give you some intuition as to why these methods work that just the math behind that that's the the only kind of formal section really it's just make you understand why it is that these things find a solution and that that solution is unique and and the right thing to do in the mdp okay so let's start with a brief introduction um so first of all what is dynamic programming you know why is it even called dynamic programming um so well first of all we have the word dynamic which basically means that we're considering problems that have some kind of sequential or temporal aspect to the problem so problems which have um step by step something changes and we're trying to solve this problem um in in some stepwise fashion and secondly the word programming we don't mean programming like writing a c program or JavaScript or something we're talking about mathematical programming um in the same sense of um linear programming or quadratic programming um and that really means optimizing a program where mathematicians by a program what they mean is essentially what we would refer to as a policy right you start off with some mapping um from states to actions in our case and you optimize that program for Behavior so dynamic programming puts these two ideas together so it's an optimization method for sequential problems I'm not sure why the color is going in a projector I'll just add a bit of kind of psychedelic flavor to the everything we talk about okay um so at its heart dnamic programming lets us solve complex Problems by doing two things it basically breaks them down into sub problems and then it puts those sub problems back together to solve the overall solution okay so any problem that we can break down into sub problems and then solve for each of those sub problems put them back together is the situation where dynamic programming can Excel specific Bally dynamic programming needs to have two properties in order to be applicable this is not about RL yet this is just the general idea of dynamic programming so you need to have something called optimal substructure which basically means that this thing called the principle of optimality applies which we'll talk about later but optimal substructure basically tells you that you can solve some overall Problem by breaking it down into two pieces or more than two or more pieces solving for each of those pieces and that the optimal solution to those pieces tells you how to get the optimal solution to your overall problem okay so if that didn't apply it would be useless to first of all solve the sub problems because by solving the sub problems you haven't made progress towards your overall problem so the canonical example is the shortest path the shortest path from here to that wall I can break it down um into some midpoint and I can say the shortest path to the to the wall is the shortest path to that midpoint um combined with the shortest path from the midpoint to the wall okay that's an example of um optimal substructure with a principle of optimality and so this basically tells us that we can take our optimal solution and we can decompose it we can start working with these little fragments solve for these smaller problems first um divide and conquer we can break things down into simpler pieces solve for those simpler pieces and then put them back together again the second property we require is called overlapping sub problems so what does this mean well this means that the subs which occur occur again and again and again like that actually we've got something to gain by breaking it down into sub problems that that by solving that sub problem we actually can solve things more efficiently than than we could have by just attacking the overall problem directly and basically all we require is that the sub problems kind of Rec recur many times that in other words if I break down the problem of getting to that wall into first of all going to the midpoint and then go to the wall that also helps me solve now I've solved the sub the sub problem of the shortest path from here to the wall that also helps me solve the the sub problem of how to get from here to the wall because I can go to that same midpoint and then still get to the wall in the most of way so the the same sub problem has occurred for many higher level problems that I want to solve so those sub problems occur many times and that means we can cach our solutions to those sub problems and reuse them and get an overall efficient divide and conquer approach so how does this apply to us well we're talking about Markov decision processes today and we're going to try and solve them using dynamic programming um and luckily mdp satisfied both of these properties and the special properties which mdps have are basically the Bellman equation the Bellman equation that we saw last week is the recursive decomposition that gives us these properties that we're after so the Bellman equation tell us last week if you recall how to break down the solute the optimal value function into two pieces the optimal behavior for one step followed by the optimal Behavior after that step that's what the Bellman equation was telling us that you can understand the overall value the optimal value from any state in terms of doing the best thing for one step and then the best thing for the remainder of your trajectory okay so that's what the Bellman equation gives us this recursive decomposition and the thing which gives us overlaping sub problems this caching and reuse is the value function the value function you can think of a value function as something which is like a cache of all the good information we figured out about the mdp the value function is something that says for any state I've already figured out what the solution is from that state onwards the value function tells you the optimal way to behave the optimal value the maximum reward you can get from that state onwards once you've got that you can kind of work your way backwards it's like saying I you know once I've already figured out the the optimal path from here to the wall I could just store that thing call that VFS or this St s and I can remember that and now if I ever want to work out anything which relies on that VVS I've already cached that information I don't need to recompute it I don't need to do like a full Tre search into the future over all possible paths that information is stored it's cached in the value function and this is why value functions are such a central and important idea for reinforcement learning okay so that's dynamic programming in in a nutshell so this is a very very kind of you know Thousand Mile overview of dynamic programming no details yet um so what we're going to do specifically is we're going to try a plan by dyam dynamic programming so if you remember the definitions of planning um planning is this problem it's not the full reinforcement learning problem it's a different problem where someone tells us the structure of the mdp someone tells us the Dynamics and the reward for that mdp and now we want to solve that mdp given full perfect knowledge of how the environment works it's not the full reinforcement learning problem so we're not going to address that today we're just going to address the planning problem so someone tells us uh the transition structure someone tells us the reward structure now how do we how do we figure out the answer to this mdp how do we solve it um and we can do we can solve two special cases of planning an mdp we can solve and again this came up in the introduction this idea of prediction and control so first of all we can we can plan to to solve the prediction problem so what does that that mean well it means that someone tells us an mdp so someone tells us that remember an mdp is like a state space an action space our transition Dynamics our reward function and a discount factor it's just our standard mdp so someone tells us the mdp but they also give us a policy they say you know for example if you're going to wander around randomly in this world then what I want to know is how much reward will you get in that world and this could also be a remark of reward process but the output of this process so what we're asking for is some black box that is this planning thing which you give it the mdp and a you give it a policy and it tells you a value function the output is a value function of the plan and the value function tells you how much reward you're going to get from any state in this mdp and so that's what we want to do for the prediction case that's the first form of planning that we might consider so this is policy evaluation figuring out how good a policy is how much reward are we going to get when we wander around in this world the second type of planning we're going to consider is control this is the full optimization when we're really trying to figure out the best thing to do in the mdp but again we're just considering the the planning case someone tells us the mdp this thing is known it's like we we've given the if you remember the Atari example it's like someone gives us all the details of the Atari game that we're going to play we we're given access to the internals of how how the next dat is produced um and we're going to try and do control so someone gives us an mdp but now instead of giving us a policy what they want to know um is the best policy they want to know among all different policies what's the most reward that can be achieved in this mdp so informally this is what we think of when we think about solving an mdp we think about this guy we want to kind of know what's the best thing that you can do in this mdp what's the most reward that can be achieved what's the best possible policy the best mapping from states to actions that is going to achieve the most long-term reward in this mdp um and so that's the problem we're going to address later on so we'll start with prediction figuring out how to work out how much reward you get under a given policy and then we'll use that as an in the loop to do control okay sorry and I should have said in addition to outputting the optimal value function when we do control we also the thing we ultimately care about is a policy we want to know how to behave what's the action that will give us the most reward in every situation every state tell me the best action that you can take and that is the optimal policy um and this implies this as we saw in the last lecture so we do control we pass in an MVP we get back optimal value function or equivalently the optimal policy okay so dynamic programming is very widely used there's lots of places it gets used so I just want to you know not leave you with this sense that dynamic programming is only used for planning and mdps um it's used in scheduling algorithms it's used in string algorithms for um like sequence alignment problems uh very popular in bioinformatics um it's used for graph algorithms like if you think of shortest path algorithms these are essentially dynamic programming algorithms um if you think about graphical models in other machine learning we've been doing supervised learning the vur algorithm is an example of of dynamic programming um latis models um bi informatics again anything where you've got some kind of structure like a lce or a string or a shortest path where there's this optimal substructure and the principle of optimality and then you can put these things together efficiently solve the sub problems figure out something about the overall problem okay but we're not going to talk about these today we're going to focus on mdps just want you to be aware that that's not what everyone means when they talk about dynamic programming okay so let's start with the first key component which is trying to evaluate a given policy so this is that case someone tells you the mdp they tell you the policy you know I'm going to WAND around this policy I'm always going to go straight ahead how much reward am I going to get um and then we'll move on and we'll see how that helps us make progress towards the problem we really care about which is maximizing reward okay so here we are we want to evaluate this policy and how are we going to do it we're going to use the Bellman equation that we saw um last week um so in particular if you recall last week we we had different flavors of bellman equation we had the Bellman expectation equation and we had the bman optimality equation and what we're going to see is that we use the bman expectation equation to do policy evaluation and we use the Bellman optimality equation later um to do cont control so each one is applied to solve a different type of problem what we're going to do is take that bman equation and turn it into an iterative update and that's going to give us our mechanism our first mechanism not the only mechanism we'll see our first mechanism for evaluating policies so what we're going to do is essentially we're going to start off with some arbitrary initial value function B1 um so B1 you can think of it's just like a a vector of um so this thing is basically telling us um what's the value of all states in the mdp we're going to start off with some arbitrary initial value so zero would be the canonical case we start off thinking I can't get any reward anywhere and then what we're going to do is we're going to plug in our one step of our Bellman equation we're going to do that one step look ahead using our Bellman equation and we're going to figure out um the value um a new value function which we're going to call V2 and we're going to iterate this process many times and at the end of this what we're going to see is the thing which we end up with is actually the True Value function there's a few more Spaces by the way just um scast around if it doesn't look so great over there and the way we're going to do this is using what we call synchronous backups we'll see the alternative later on um when we get to later slides um so what do we mean by synchronous backups what we mean is that every single iteration what we're going to do is um we're going to every iteration we're going to consider all states in our mdp so we're going to sweep over all states in one iteration so what we're going to do is we're going to start off with our previous value function and we're going to apply our iterative update to every single state in this value function to produce a complete new value function for all states and then we're going to look at every state in this value function we're going to apply a complete new update to give us a new value function that's called synchronous evaluation we consider all states every step and we'll consider a in backups later which can be more effective this is the simplest case okay and don't worry about convergence yet we'll see later on I'm going to defer the question of convergence so you just have to trust me for now um that this process that we'll see really does converge to the true uh value uh for each of the states so let's understand exactly what we're doing what actually is the update so what we're going to do is we're going to take our um bman equation that we had before so if you remember this was our look ahead tree that kind of showed us how to comp compute the value function um this was the bman expectation backup soal expectation because we've got an expectation over our actions rather than a Max over actions that gave us the max gave us the optimality equation the expectation gives us the B expectation equation and this was the vector form of the same thing but just intuitively I like to think of it as just a one step look ahead we look ahead the value of the root and this is just the Bellman equation for now the value of the root is given by a one-step look ahead where we consider all the actions we we might take this is the things I might do as an agent and then all the places the wind might blow me um and then we look at the value at these successor states that we might end up in we back that all up we sum it all together weighted by the probabilities of each leaf and that gives us our our root value the value function at the root so all we're going to do with dynamic programming is we're going to take this equality so the Bellman equation told us this thing must be true for the True Value function V Pi that tells us that this this must be equal to this and now all we're going to do um for dynamic programming is we're going to turn this equation into an iterative update in other words we're going to define the value function at the next iteration by plugging in the previous iterations values at the leaves so basically take our value function from the last iteration BK we plug in those values that we had at the leaves we back up those values to compute one single new value for the next iteration of the root so we figure out VK plus1 s by doing a one-step look ahead looking up our values here from our last value function and now we fill in got one state we're going do this for every single state every single state in our suite our synchronous evaluation of dynamic programming every single state gets a turn at being the root and so they all get one of these updates we do this for every single state in the mdp um and we figure out the new value of that state at the iteration and that gives us a complete new value function okay that's the idea of um iterative policy evaluation and then what we do is we iterate that process we iterate that process we take the value function we've just computed our vk+ one that we just computed and now that will take a turn at being the leaf we plug that in at the leaves we back that up and we would get VK plus2 at the at the root we do that for every state we iterate and this process is guaranteed to converge on the True Value function Yeah question Happ that there won't be there won't exist Vector of that satisfy the equation another vector v no no no that you know there won't be any Vector that would satisfy the equation is is the question how do we know that there is the existence of of a v that satisfies this equation yeah I'll come back to that at the end of the talk if you don't mind um so there's a a formal proof that I'll at least try to illustrate um at the end of the of the class so for now there always is so it comes from the uniqueness property of the solution of the mdp which we touched on last time there was a theorem in the last lecture that said for any mdp we were talking about V Star there the same is true for V Pi but there's there's um there is a unique value function that satisfies this for any mdp and we'll see why later it's essentially contraction mapping theorem it's the easiest way to see it okay is that clear for now let's do an example I think that will help right right okay so we'll consider this um very simple example okay so we're going to consider this this is a small grid World um so what's this grid World about so we've basically just got this little 4x4 grid um it's got two shaded states in the corner here these are terminal States if you end up in this state end of episode um no more reward or equivalent you can think of this you just end up going round and round and round getting zero rewards forever same thing okay um now what happens elsewhere is that you've got four actions that you can choose Northeast south or west four Compass directions um if you take an action that would take you off the grid it just has no effect and you remain there okay that's the definition of these grid well examples we'll see more later on as well and we're going to get a reward of minus one per step regardless of which action you take so if you go northeast south or west you're going to get minus one regardless of whether it'll take you off or not you're just going to get minus one per step um and the only thing which differentiates how much reward you get is when eventually you get the terminal State um you stop getting these negative rewards the episode's o over so um you can think of this as telling you how long it takes you to reach one of these gray States that's what we're trying to figure out that's the the length of a trajectory one sample of this of the mark of chain corresponding to whatever policy we choose will tell us you know it'll have some you could do some random walk around here eventually it'll end up here and you'll get minus one per step and that so the sum of all of those rewards tells you how long the path was that took you to terminal State and what we want to know is on average how how many steps until we hit one of these two gr terminal States and what we're going to do is we can consider the simplest policy we can think of which is just uniform random so probability of all four Compass directions is just a quarter um and we want to know well how long does it take I mean it's not you know if you look at this it's not obvious just at a glance how many steps it will take you to find one of these great um uh States on average okay so we want to ask that question we want to be able to solve problems like this what's the if you if you behave randomly if you just do this random walk around here how many how many steps will it take in expectation until I hit one of those gray things okay we're going to use dynamic programming to solve this so it looks something like this so just for now concentrate on the left column I'll explain the right column in a minute okay in the left column we've got our value function so we've got a a state value function V which is telling us um the value of every single state so this is our estimate of how much of how many steps until we hit the gray State until we hit a terminal State and we're just going to start off like I said with the most naive estimate we could think of which is plugging in zero everywhere so this is our initial estimate um so this is our v0 if you like I think I called it V1 earlier um and now what we want to do is we want to apply um one step of iterative policy valuation so we're going to apply the Bellman expectation equation as a backup we're going to apply it to all states and we end up with a new value at each of these states so how do we do that well what we do is we look at each state in turn so if we just pick one of these states like here um so all we're doing is we're saying well whichever direction we go in we're going to get minus one reward because we've taken a step um and then we're going to look at the value according to our old estimate our previous estimate which said we're going to get zero everywhere so we're going to basically say I'm going to get a minus one step and then have a reward of zero so we're going to plug in a new value here of minus one and the same logic is going to apply everywhere except in the terminal states where in the terminal States it's already terminal so if we take one step ahead we don't get any more negative rewards it's like you just stay there and get z0 z z so in the terminal rewards we we just have zero at the next step regardless so this is like a naive one-step look ahead to say you know whichever step I take it's going to get minus one unless I'm already at the goal in which case I'm not I know it zero step until they gone now we're going to now we have a new value function we're going to iterate we're going to apply the same process again so we're going to iterate to consider some new States so again let's consider that same state again down here um so for this state we're now going to do again a one step look ahead so one step look ahead says well I could go north east south or west I'm going to consider each of those with probability .25 and in each case I'm going to have an immediate cost of minus one plus the value where I end end up so I think at this point I think the value of going north is minus one minus another one from my estimated value of where I ended up okay so from this state I think it's minus two because I'm going to average over this minus two this minus two this minus two and this minus two so I think the value of being here at this point is minus two however if I consider say this state here now I could go north east or south and for each of those cases I'm going to estimate the same thing if I go east I'm going to see minus one minus another one for where I end up so minus one for the step and then minus one for my previous estimate the value um if I go north remember that I end up just sticking where I was so same thing minus one for taking that step and then I estimate the value of where I ended up to be minus one so minus two again but if I go west now I think there's going to be an immediate cost of minus one and then my estimated cost of the goal is zero so I've got this should be- 1.75 I think it just doesn't fit um so I've got .25 probability of of um just getting minus one reward for the immediate cost and I've got 75 probability of of getting minus two okay um so we iterate this process if we keep iterating so this is after three steps this is after 10 steps um once we iterate to convergence we see that this thing stabilizes eventually on um the True Value function so this is actually the True Value function to um zero decimal places um and this basically tells us you know if I'm taking this random walk how many steps is it going to take on average until I actually hit one of these gr States and it's around 20 for some of these cases um so is that clear this process of iterative evaluation so now if that's clear we can talk about the right hand column so the right hand column is basically telling us it's just F information now it's telling us that not only has this been useful because it's been an iterative process that kind of tells us how much reward we're going to get when we wander around randomly and we've sort of figured that out by itative process that converges on the on the True Value function it also tells us how to make a better policy so what we can do is we can we can sort of start off by looking at this and we can ask the following question which is what if I was to pick actions not randomly now but according to the values in this Grid in other words if I acted greedily according to the values in this grid what decision would I make so to begin with I would look and I would say well you know I think everything is worth zero so I'm indifferent to which action I take so these arrows here where there's multiple arrows it kind of indicates that there's ties we then just behave randomly according to um to break ties so we start off with a random policy but after one iteration we're already doing something non-random which is that say if we're in this state here for example we can see that going north east or South has a a value of minus one um but going west has a value of zero um so we prefer to go west here and so we can start to see that we start to fill in more reasonable values um if we were to behave greedily with the respect to this guy um we can see that we start to do a little bit better now two steps ahead we've got this Behavior we know to either go north or west and so forth and after just uh after just three iterations before we've even iterated this all the way to convergence we already have the optimal policy when we behave optimally with respect to this value function so hold that thought this intuition is going to be useful later that you don't actually have to iterate this thing all the way to convergence um but the main thing I just want you to have in your head for now is that the value function helps us figure out better policies but even though we were just evaluating one policy we can use that to build a new policy by acting greedily is that clear yeah something that is not intuitive for me is like it somehow should U simulate the distance from the terminal points but we have like this we have the value function 20 we have like you know for different distances right you see like we have like two steps from the and freeps the same value function yes this one and this one this one and this one yes for example it's just you can't see what's beyond the decimal point oh okay sorry it's just these are truncated just to fit on um so this this will have a a worse value than this um it's just we can't see it to Precision Yeah question I don't quite understand that the policy improves but the numbers are still computed based on taking a random it's not fed back into the calculation no that's coming next yeah so for now we're just evaluating the random policy and all I want to show you is that that this that any value function can be used to compute a better value function okay so that was iterative policy evaluation so so far um if we ignore that right hand column really what we've been focusing on is this problem of how to figure out the value of my current policy and now we've got a mechanism for doing so a first mechanism for doing so which is to iterate our Bellman equation again and again and again U feed it back into itself feed the value back into itself self do these one-step look ahads and figure out what the new value is in every state and that's been a process that helps us figure out exactly how much reward we're going to get okay so um now that question was great because it leads me right into the next section which is policy iteration so now we're going to talk about the next problem which is here we were just trying to evaluate a fixed policy for example the uniform random policy and now we want to actually find the best possible policy in the mdp and the way we're going to do that is by a feedback process just as was suggested then so so let's start with um you know this basic piece of intuition which is you know we want to make our policies better how do we make a policy better so if you someone gives you a policy Pi how can you give back some new policy that you can say for certain will be better than the one that you had before and if we had such a process we could just apply that again and again and again and we would end up eventually finding the optimal policy that's the intuition okay so just first of all an intuitive level um and then we'll do it slightly more formally in a minute um what we're going to do is break this down into two steps uh first of all what we're going to do is evaluate that policy we're going to figure out a value function for that policy so if someone gave me a policy all I need to do is compute the value function like we just did in the previous example in the grid World um which tells us how much reward we're going to get in expectation like what's the average path length until I hit that Gray State um from any start state so it's just filling out those numbers like we just did and then we improve the policy we improve it just like we did on the right hand column by acting greedy with respect to our value function so we just need these two steps so you start with a policy you evaluate that policy and then you act greedy with respect to U the numbers that you filled in you act greedy with the respect to your value function um and we'll Define precisely what we mean by greedy but essentially we mean precisely what we saw in the last example um where you kind of just look ahead you look ahead in each direction and you pick the one which is the best so you consider each action and you look at the value of where it takes you and you pick the one which is best that's what it means to act greedily so so this policy will convert to the optimal policy or just just something it's coming it's coming just be patient okay so in the small grid World um we just did this once okay we just started with the uniform random policy we have evaluated that policy once that was the left hand column and we improved the policy once that was the right hand column we looked at what would happen if we took that value function we ended up with and then acted greedly with the respect to act and in the small grid world that was enough so if you did that once if you went all the way to the bottom of the left column and then acted greedily that already gave us the optimal policy that isn't normally the case in more interesting problems but what is the case is that you can go round and round this process iteratively you can basically start with policy you can evaluate that policy improve it evaluate your new policy improve it evaluate that policy improve it and eventually uh what we'll see is that this process always converges to the optimal policy okay we'll see why in a second but first I just want to give a bit more intuition yeah questions yes remember sorry just to explain why so recall from the last lecture that there is always at least one deterministic optimal policy for any mdp so it's sufficient to actually search over deterministic policies only when you're looking for the optimal policy what I mean is when you when you do this evaluation improvement process when you do the improvements your policy then is is greedy on the evaluation sois yes no no it'll be a different deterministic policy at the next step until you Converge on the an optimal deterministic policy but you're right that that after your initial policy every other step will be deterministic um there is a relationship between policy iteration and expectation maximization um that's a little bit subtle and has been explored in recent papers it's not an explicit equivalence it's um but there's a relationship so it's um yeah okay so just in one picture this is the idea that we're going to be using again and again not just today um but in many subsequent classes as well we're going to be using this idea repeatedly so you start off with some inputs to to this process the inputs are some arbitrary value function like this all zeros and and some policy and what we're going to do um is we're basically going to start off by we're going to do policy evaluation on this on our up arrows and we're going to do um policy approvement on our down arrows and so what this diagram illustrates we start off with VM pi and what we do is we set our value function to evaluate exactly this initial policy on this first aror once we've got that policy we act greedy with respect to that policy to give us um a new policy we we really respected that value function to give us a new policy once we've got that new policy we evaluate the new policy that gives us a value function once we have that new value function we act Greely with respect to it once we have that new policy we evaluate it and so on and what we'll see is that this is guaranteed to converge to both the optimal polic the optimal value function and therefore the optimal policy so we have this cycle of evaluation and Improvement and what we're going to see over the coming lectures is how we can sort of substitute in different components here we'll keep keep to roughly this picture but in sort of more and more sophisticated ways um but in this case what we can say is we've seen at least one way to do policy evaluation where we estimate V Pi here these up arrows are done using iterative policy evaluation that was this bman expectation equation that was iterated to solve uh um figure out the value of the policy that was the first section of this class and now we've seen at least one way we can do policy Improvement which is by acting greedily with respect to our value function and so we can just iterate this process and we have a solution method for an mdp um yeah question no matter where you start for any value function any policy you will always end up with the optimal value function the optimal policy fundamental theorem we'll see why shortly okay let's give an example first just get more intuition so okay so we're going to have I think it's ful just to see that that some of this stuff can be applied somehow in the real world so this is a toy example but it's sort of indicative of something that might that's people might care about in the real world so it's a problem from s called Jack's Car Rental um and basically what's going on is um there's this car rental place it's got two different locations and cars are rented from each of these two locations um and there's a maximum of 20 cars in each location um and what happens is there's these kind of random request return process where where the um people are going to come in and and randomly come in and rent cars from either one of these locations and randomly they're going to return cars to these locations but the rate at which they rent uh at which they rent and return the cars is different in the two locations and so what you get to do is every night you get to move some cars from one location to the other and you need to do this in the optimal way so that um when people actually try to rent a car you want to make sure that cars are available to be rented you basically don't want to end up with an empty car park when some customer comes in and tries to rent a car because you lose money and the question is um what's the optimal policy for for shifting cars around so that you maximize your income so it's a relatively kind of real world scenario um we can formula uh formalize this as a mdp very simply by just considering the states to be how many cars we have in the two locations the actions are how many cars from minus 5 to plus five we move from location a to location B every night reward $10 per car that's rented transitions of the actual uh return and and and request process um and I don't want to get into the details of it there's some Plus on distribution determining this random process in the example U but roughly speaking the main point is that one location gets the same number of requests and returns like the return rate is the same as the as the request rate it's a random process but with the same rate but the other one gets more requests than it does returns so the second location is kind of always going to be short of cars if you don't shift things over um not always it's typically going to be short of cars because it gets more requests for cars than than people are actually returning them okay so what we're going to do is just see what policy iteration looks like so we're going to apply this idea of policy iteration this is all just I know I haven't been concrete yet about about the algorithm I just want to give you a flavor of what this algorithm looks like first and all and so what we've got here is we're just looking at the policies to begin with these five squares here represent the policies that we're considering um so the axes are um the cars at the first location on the y- axis and the number of cars at the second location on the x- axis so this is the states that we're at in our MVP you know whether we've got like 10 cars at location one and zero cars at location two that would be here for example and the question is what do you do um in that location and what we see from these diagrams is you kind of look here and this would tell you you know if you're in that situation you should move four cars from um A to B okay so that's what these diagrams are indicating they're indicating the policy telling us at any stage of our algorithm what we believe the best thing to do is so we start off with just um arbitrary policy in this case we're just going to do nothing as our initial policy we're going to seed our policy iteration process by first of all evaluating that policy and what we do is each time we evaluate a policy we build up a value function which looks something like this okay we build up a value function saying you know if I take um the actions that are recommended in this case do nothing I'll end up with some surface that tells me how much money I'm going to make doing nothing and then what we do is we act greedily with respect to that surface we basically pick the action which according to my current value function um looks looks best I do like a one-step look ahead take the action which looks best according to my current value function and that gives me a new policy and we see that immediately after one iteration we gain some information we see that you know on the whole we think that we should be shifting um cars over around five from location A to B um quite often and sometimes we should be shifting cars back again minus one- two- three- four but most of the time we should be shifting to the location with less cars we've already figured that out the one that gets less cars um um returned but we have assign probabilities how many car like how many cars are rented and returned right I didn't understand the question sorry uh so like to estimate the next step we need to know how many cars say what's manys were um we don't need to know exactly how many so so the question is do we need to know how many cars were rented or return just probability on this so we need to know the probabilities yeah so so we need so so we're doing so remember we're just doing planning we're just doing planning so the Dynamics of the system are known in this case the Dynamics of the system are the return rates and the and the request rates distri given by Pro on distribution so we know those we can use those that's our P SS Prime so we know the probability that we're going to have a a customer come in and request a car and we know the probability um that they're going to come and return a car to each of the locations so those things are known and given that information we're trying to figure out what's the best behavior strategy like what should I be doing shifting so it's like typically an industry will help people use this if someone comes up with a model of their system that model might not be correct but then they use dynamic programming to solve for their model so you can think of this as like some simplified model of the car rental um the p on process might not be accurate but there's some way in which the Dynamics are described okay and then what we do is we take our new policy so this was just like in the small grid world we just did one step and then in the small grid world that was already optimal this is a more interesting problem this is not the optimal policy so what we do is we evaluate this policy now we come up with a new Surface a new value function that looks something like this and we act um greedily with respect to that new value function so now again we do a one-step look ahead we say you know given that we think um customers are coming in at particular rate we do this look ahead we figure out which action is going to be the best we evaluate you know how much um money we think we're going to make by using this value function and doing a onstep look ahead on on our current estimate and that gives us a new policy then we evaluate this policy we come up with a new Surface new policy new Surface new policy and after one two three so I guess these five policies on our fifth policy or the fourth iteration of the algorithm we already have the optimal policy this has converged now um I think there's still a small change from here to here um but after this it doesn't change anymore the the problem stabilizes it finds the optimal solution and so I've already had quite a few questions where people ask well why why why is this actually stabilizing why does this work so I'll talk about that in the next two slides um but before we do that is people clear on the intuitive way this algorithm's working um like what what's going on this is a great time to ask questions don't just kind of feel oh what's going on I don't understand how can I read the policy off that grid okay how can you read the policy Off the Grid so um you read off the state so the state is given by any point here so if we consider a point here where we say got um you know 17 cars at this location and 10 cars at this location um we're just looking at this number here and it's one so that's telling us that we should move one car from location a to location B so these are like it's like a contour map this this the contour map of the policy sorry if that wasn't clear um so we're reading off for each region what the policy is within that contour map um what's on the vertical axis there 612 and 420 okay good question so what's on the vertical axis so this is the surface telling you um the dollars return that you're going to get like how much money are you going to actually make like this is the question these guys want to know it's like you know ultimately dollar value how much money am I going to make following this policy um and what this tells you is um if you're in any situation so if you're in a situation where you've got no cars in either location uh you you're still going to get money because eventually over time you will start getting cars coming into these locations um but it's less money than if you've got 20 cars at each location so you should expect some kind of slope upwards and what this is telling you is that the dollar total dollar lifetime Revenue that you're going to get under some discount factor which was probably on the last slide telling you how much you prefer to get dollars now than dollars tomorrow or the next day or the next day so this is your income this is the thing we care about so this is really intuitively we're building our policy by looking at the right thing we're looking at how much dollar income am I going to get by following one of these policies and then choose the the actions that maximize dollar income the total amount of lifetime value you're going to get by following this this whole policy there's no cost in moving a car from um there's no cost in moving a car in this example yeah question was how sensitive is that the convergence rate to Pi z um and we'll see later at least for some of the algorithms the the convergence rate is independent of Pi z um um but in practice that doesn't answer the question which is um that you might just your Pi Z your your initial value may be way off and your first policy may be way off and so clearly if you start with the optimal policy you'll get to the optimal policy faster um so so it does matter but the convergence rate is is independent of the initial value and policy okay so let's just do this a little bit more formally so people understand get more comfort with this idea that this really does work okay so let's let's say what it means a bit more precisely to do a policy Improvement so we're going to start off with some deterministic policy for now so as someone pointed out in the audience um we um after one step of acting greedily we we we're stuck with deterministic policies anyway so we may as well just consider um when we want to do our our step of this algorithm and understand convergence let's just imagine we started with a deterministic policy because after just one iteration we're going to have a deterministic policy anyway acting greedily always makes you deterministic um and so so how do we come up with our new policy and the way we come up with a new policy is by acting greedily well now let's define what that means to act greedily so acting greedily means that we basically we look at the value of being in a state and taking a particular action and then following your after that that's the action value that's precisely the action value so what we want to do is pick actions in a way that gives us the maximum action

Original Description

#Reinforcement Learning Course by David Silver# Lecture 3: Planning by Dynamic Programming #Slides and more info about the course: http://goo.gl/vUiyjq

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google DeepMind · Google DeepMind · 7 of 60

← Previous Next →

RL Course by David Silver - Lecture 8: Integrating Learning and Planning

RL Course by David Silver - Lecture 8: Integrating Learning and Planning

Google DeepMind

RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning

RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning

Google DeepMind

RL Course by David Silver - Lecture 2: Markov Decision Process

RL Course by David Silver - Lecture 2: Markov Decision Process

Google DeepMind

RL Course by David Silver - Lecture 5: Model Free Control

RL Course by David Silver - Lecture 5: Model Free Control

Google DeepMind

RL Course by David Silver - Lecture 6: Value Function Approximation

RL Course by David Silver - Lecture 6: Value Function Approximation

Google DeepMind

RL Course by David Silver - Lecture 4: Model-Free Prediction

RL Course by David Silver - Lecture 4: Model-Free Prediction

Google DeepMind

RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

Google DeepMind

RL Course by David Silver - Lecture 10: Classic Games

RL Course by David Silver - Lecture 10: Classic Games

Google DeepMind

RL Course by David Silver - Lecture 7: Policy Gradient Methods

RL Course by David Silver - Lecture 7: Policy Gradient Methods

Google DeepMind

Google DeepMind: Ground-breaking AlphaGo masters the game of Go

Google DeepMind: Ground-breaking AlphaGo masters the game of Go

Google DeepMind

Match 1 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Match 1 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Google DeepMind

Match 2 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Match 2 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Google DeepMind

Match 1 15 min Summary - Google DeepMind Challenge Match

Match 1 15 min Summary - Google DeepMind Challenge Match

Google DeepMind

Match 3 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Match 3 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Google DeepMind

Match 2 15 Minute Summary - Google DeepMind Challenge Match 2016

Match 2 15 Minute Summary - Google DeepMind Challenge Match 2016

Google DeepMind

Match 3 15 Minute Summary - Google DeepMind Challenge Match 2016

Match 3 15 Minute Summary - Google DeepMind Challenge Match 2016

Google DeepMind

Match 4 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Match 4 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Google DeepMind

Match 4 15 Minute Summary - Google DeepMind Challenge Match 2016

Match 4 15 Minute Summary - Google DeepMind Challenge Match 2016

Google DeepMind

Match 5 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Match 5 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo

Google DeepMind

Match 5 15 Minute Summary - Google DeepMind Challenge Match 2016

Match 5 15 Minute Summary - Google DeepMind Challenge Match 2016

Google DeepMind

DQN SPACE INVADERS

DQN SPACE INVADERS

Google DeepMind

Google DeepMind

Asynchronous Methods for Deep Reinforcement Learning: Labyrinth

Asynchronous Methods for Deep Reinforcement Learning: Labyrinth

Google DeepMind

Asynchronous Methods for Deep Reinforcement Learning: MuJoCo

Asynchronous Methods for Deep Reinforcement Learning: MuJoCo

Google DeepMind

Asynchronous Methods for Deep Reinforcement Learning: TORCS

Asynchronous Methods for Deep Reinforcement Learning: TORCS

Google DeepMind

Differentiable neural computer family tree inference task

Differentiable neural computer family tree inference task

Google DeepMind

StarCraft II DeepMind feature layer API

StarCraft II DeepMind feature layer API

Google DeepMind

DeepMind Health – Partnership with the Royal Free London NHS Foundation Trust

DeepMind Health – Partnership with the Royal Free London NHS Foundation Trust

Google DeepMind

DeepMind Health – Michael Wise – a patient's journey

DeepMind Health – Michael Wise – a patient's journey

Google DeepMind

Streams – a platform for a digital NHS

Streams – a platform for a digital NHS

Google DeepMind

DeepMind Lab - Nav Maze Level 1

DeepMind Lab - Nav Maze Level 1

Google DeepMind

DeepMind Lab - Stairway to Melon Level

DeepMind Lab - Stairway to Melon Level

Google DeepMind

DeepMind Lab - Laser Tag Space Bounce Level (Hard)

DeepMind Lab - Laser Tag Space Bounce Level (Hard)

Google DeepMind

Exploring the mysteries of Go with AlphaGo and China's top players

Exploring the mysteries of Go with AlphaGo and China's top players

Google DeepMind

Demis Hassabis on AlphaGo: its legacy and the 'Future of Go Summit' in Wuzhen, China

Demis Hassabis on AlphaGo: its legacy and the 'Future of Go Summit' in Wuzhen, China

Google DeepMind

The Future of Go Summit: AlphaGo & Ke Jie match 1 moves analysis

The Future of Go Summit: AlphaGo & Ke Jie match 1 moves analysis

Google DeepMind

The Future of Go Summit: AlphaGo & Ke Jie match 2 moves analysis

The Future of Go Summit: AlphaGo & Ke Jie match 2 moves analysis

Google DeepMind

The Future of Go Summit: Pair Go moves analysis

The Future of Go Summit: Pair Go moves analysis

Google DeepMind

The Future of Go Summit: AlphaGo & Ke Jie match 3 moves analysis

The Future of Go Summit: AlphaGo & Ke Jie match 3 moves analysis

Google DeepMind

Emergence of Locomotion Behaviours in Rich Environments

Emergence of Locomotion Behaviours in Rich Environments

Google DeepMind

StarCraft II 'mini games' for AI research

StarCraft II 'mini games' for AI research

Google DeepMind

Trained and untrained agents play StarCraft II full 1vs1 game

Trained and untrained agents play StarCraft II full 1vs1 game

Google DeepMind

DeepMind open source PySC2 toolset for Starcraft II

DeepMind open source PySC2 toolset for Starcraft II

Google DeepMind

ICML 2017: Test of Time Award (Sylvain Gelly & David Silver)

ICML 2017: Test of Time Award (Sylvain Gelly & David Silver)

Google DeepMind

Ke Jie and DeepMind's Go Ambassador Fan Hui review the 3rd AlphaGo vs Ke Jie game

Ke Jie and DeepMind's Go Ambassador Fan Hui review the 3rd AlphaGo vs Ke Jie game

Google DeepMind

Ke Jie and DeepMind's Go Ambassador Fan Hui review the 1st AlphaGo vs Ke Jie game

Ke Jie and DeepMind's Go Ambassador Fan Hui review the 1st AlphaGo vs Ke Jie game

Google DeepMind

Ke Jie and DeepMind's Go Ambassador Fan Hui review the 2nd AlphaGo vs Ke Jie game

Ke Jie and DeepMind's Go Ambassador Fan Hui review the 2nd AlphaGo vs Ke Jie game

Google DeepMind

AlphaGo Zero: Discovering new knowledge

AlphaGo Zero: Discovering new knowledge

Google DeepMind

AlphaGo Zero: Starting from scratch

AlphaGo Zero: Starting from scratch

Google DeepMind

Defining principles for tech companies in the NHS: DeepMind Health's Collaborative Listening Summit

Defining principles for tech companies in the NHS: DeepMind Health's Collaborative Listening Summit

Google DeepMind

A systems neuroscience approach to building AGI - Demis Hassabis, Singularity Summit 2010

A systems neuroscience approach to building AGI - Demis Hassabis, Singularity Summit 2010

Google DeepMind

Retour de Rémi Munos en France et ouverture de DeepMind Paris

Retour de Rémi Munos en France et ouverture de DeepMind Paris

Google DeepMind

Grid cells - Caswell Barry, UCL

Grid cells - Caswell Barry, UCL

Google DeepMind

DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: What our research shows

DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: What our research shows

Google DeepMind

DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: A Patient's Story

DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: A Patient's Story

Google DeepMind

Deep Learning 3: Neural Networks Foundations

Deep Learning 3: Neural Networks Foundations

Google DeepMind

Deep Learning 5: Optimization for Machine Learning

Deep Learning 5: Optimization for Machine Learning

Google DeepMind

Deep Learning 8: Unsupervised learning and generative models

Deep Learning 8: Unsupervised learning and generative models

Google DeepMind

Reinforcement Learning 1: Introduction to Reinforcement Learning

Reinforcement Learning 1: Introduction to Reinforcement Learning

Google DeepMind

Deep Learning 2: Introduction to TensorFlow

Deep Learning 2: Introduction to TensorFlow

Google DeepMind

This video lecture covers planning by dynamic programming in reinforcement learning, including policy evaluation, policy iteration, and value iteration. It provides a comprehensive introduction to solving Markov Decision Processes (MDPs) using dynamic programming.

Key Takeaways

Evaluate a policy using the Bellman expectation equation
Improve a policy by acting greedily with respect to the value function
Converge to the optimal policy and value function
Apply policy iteration to find the optimal policy
Use dynamic programming to solve MDPs

💡 Dynamic programming can be used to solve MDPs and find the optimal policy by iteratively evaluating and improving the policy.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Snapshot Once, Rollout a Thousand Times: A Practical RL Setup for Coding Agents

Learn to optimize Reinforcement Learning (RL) by snapshotting the environment and forking it into multiple rollouts, reducing the bottleneck in RL training

Dev.to · Sebastian Buzdugan

Trust Region Policy Optimisation — The Exact Method PPO Approximates

Learn the exact method that Proximal Policy Optimization (PPO) approximates, Trust Region Policy Optimisation, and its implementation in Python for advanced reinforcement learning

Medium · Python

Hierarchical Reinforcement Learning in StarCraft Micromanagement with Influence Maps and Cluster-based Scripts

Learn how to apply hierarchical reinforcement learning to StarCraft micromanagement using influence maps and cluster-based scripts to improve AI decision-making in real-time strategy games

Proximal Policy Optimisation — The Clip That Made Policy Gradients Reliable

Learn how Proximal Policy Optimisation (PPO) makes policy gradients reliable in reinforcement learning

Medium · Machine Learning

Middle Management Meritocracy: Shockingly Naive