RL Course by David Silver - Lecture 6: Value Function Approximation
Key Takeaways
The video lecture by David Silver covers value function approximation in reinforcement learning, including incremental and batch methods, differentiable function approximators, and convergence to optimal solutions. It also discusses various algorithms such as Monte Carlo learning, TD learning, and TD Lambda, as well as the use of neural networks and experience replay for stabilizing the learning process.
Full Transcript
uh so welcome back uh today we're going to start getting serious learn how to scale up reinforcement learning to real practical problems uh by the end of today's class hopefully you'll be able to go off into the world and actually program interesting reinforcement learning agents that can solve uh problems in fact we'll see you know state-of-the-art results and um you'll really get to the point where you're there where you can see the whole algorithms do um policy evaluation do control um using function approximators to represent the the value function that's the key idea for today um the outline is going to be um fairly simple today we start off just understanding the the basic space we're working in um and then we're going to split things up into two different approaches first of all we're going to start with incremental methods really this this approach is to say you take some function approximator like a neural network and incrementally every single step you see some new piece of data that comes in and immediately online um you take some step um to update U your value function online um and then we'll start looking at more data efficient methods that kind of look at the whole history of everything you've seen so far and start to um look at the whole set of data and start to exactly fit your value function to everything you've seen so far um so roughly we're calling those batch methods you could think of this also as online methods uh the split is a little bit gray as we'll see and there's certain methods which work very well here that build on some of the ideas that we use here um so it's not quite as clearcut but I think it's a helpful distinction um to understand the space of value function approximation methods um so why are we interested in value function approximation you know what's what's wrong with The Story So Far um well we'd like to use reinforcement learning to solve large real world interesting problems so how large do problems get you know let's just get a sense of scale first of all so if we think about some of the problems that people have solved I mean you know I've worked a lot in games and um you know in games you often see numbers like this where the game of back gam is considered a very small game for uh working in games that's got just 10 to the 20 States um if you work in something like go um there's 10 to 170 States um but you know if you work with robots in the in continuous State space there's an uncountably infinite number of states in this in this domain so something's got to give you can't just build a table anymore you know the approach we've seen so far of just building a table and having a separate value for each state in your state space it's just not practical it's not going to work it's not going to scale up to these interesting problems um so we need some method that scales up and regardless of the size of the state space we want to be able to get methods that work we'd like methods that work with you know infinitely large State spaces and you just happen to build a function approximator that estimates the value of just the parts of the space which you visit and know about and which generalizes across parts of those space so that you don't separately represent the value of being here from the value of 1 millimeter over to the right now intuitively the value of those States should be pretty similar and we want our value functions to understand that generalization without having to separately store a distinct value for each of these states so that's the goal of today's lecture is to understand how to achieve that generalization and to build up efficient methods for both representing and learning value functions in reinforcement learning in the next class we'll look at other approaches to using function approximation uh for policy space algorithms rather than um rather than value based we'll look at policy based algorithms um but today we're really going to focus on value functions how can we scale up value functions have been the central idea we've seen so far and so in particular we want to know how to do the core ideas that we've seen so far both prediction how to evaluate a policy and also how to do control I mean these are the main subjects of the last two lectures and we want to scale them both up we want to be able to achieve good results with both of these ideas both of these paradigms so here's the basic idea in a nutshell so so far we've really looked at this value function we've had some value function V of s and this V has represented some kind of lookup table so for every state or for every state action pair um there's been a single value stored so for V there's been like a single value per state and we built this big table that said you know for every distinct s there's going to be a value V of s and we were then able to build algorithms that manipulated those V of s's to to figure out the problem and pick the right actions and solve the MVP um when we wanted to do control if you remember with control we need to be able to pick our actions without using a model if we want to do model free control so we need to use an action value function q that lets us consider for each state and for each action separately you know what's the value of taking that action a and that state s um and this Q this um action value function was sufficient to do control because if we have this thing we can immediately pick our actions by maximizing over our a in each state s but again so far we were just restricting ourselves to this giant table that has for every State and also for every action now so think of this as like a two-dimensional table for all states in your state space and all actions in your action space we had a separate entry in the table okay and so let's be specific about the problem now um there are just two many states and or actions to store in memory that's the first problem but also even if you can store it in memory if you've got such a vast data space like let's say you fill up your whole memory you've got you know uh a few gigabytes of of memory maybe you can build just a really really large table and represent your mdp but even if you do that it's going to be too slow to learn about them all in practice you know we don't want to see trajectories that estimate the value of each of those States and actions separately it's just going to be too slow you need too much data um and so what's the solution the solution is value function approximation and we can basically represent the idea very simply in this following way what we're going to do is we're going to consider the True Value function V Pi of s as just some function m from s to the true value of V of Pi what we're going to do is try and build some function that estimates this thing everywhere so if you feed it in n s um it will give you some estimate of this V Pi of s and the way we do that we're going to consider parametric function approximators that have some parameter Vector W so w is just some Vector of Weights um think of this as the parameters of your neural network or um if you have a linear combination of features it's going to be the weights of those features I'm going to estimate value function by building some function approximator that fits this thing for across the whole state space so we're going to try and fit this using some compact representation that's going to have a smaller number of Weights typically um than the number of states in the state space so using a small number of parameters we can now fit this thing everywhere even we don't have to have a single entry for each for each state in the state space and alternatively we can do the same idea for the action value function where we can try and F the true action value function Q Pi which tells you how good is it to take action a from State s um and we're going to just consider that as a this big function for any s and any a we should be able to query this function and we want to fit that function for all S's and all A's we want to just fit this thing so that now if we feed in our s and our a we're just going to have some Vector of weights again this could be the weights in your neural network we're going to try and fit some function approximator that estimates this thing everywhere across the whole state space and the whole action space I'm just going to think of that as a big function and try and fit that function and what this is going to enable us to do is not just reduce the memory um so we can reduce the memory if there's a smaller number of Weights than we had States um it's not just going to reduce the memory it's also going to allow us to generalize and the reason it allows us to generalize is because we can fit our function um to states that we've seen and we can generalize to states that we haven't seen by querying our function approximator at points that we've never actually seen so far um and the way we're going to do this then so we'll get into more detail so we don't have to understand everything from this slide but the way we're going to do this is basically we're going to think of this parameter Vector the weights of our neural network we're can update these things um using the methods that we've seen in the last couple of lectures Monte Carlo learning TD learning TD Lambda and so forth that's going to give us the targets um for adjusting for fitting our function approximator so that we can start to understand how to get the right value function but before we do that what we're going to do is before we understand how to do the Monti car or TD learning we're just going to focus on what it means to do function approximation um with a value function so this slide basically shows us three different types of function approximation we can use with value functions so these are like different architectures you could use um so I mean I tend to think of you know neural network as a canonical Black Box function approximator but you can use anything you like as this black box any function approximator at all they're all valid for reinforcement okay and the idea is um if we're just trying to do state value function approximation you've got this black box here you feed in your state here you've got something like a a neural network that's going to kind of represent this wiggly line is supposed to represent the fact that there's some internal function that this thing is trying to fit and represent using some internal parameter Vector W um and the way in which W affects this is just part of this black box we don't worry about that and what this thing is going to spit out is the value function at this query State s so we feed in s this thing estimates how good is it to be in s so we feed in I'm in this state over here and this thing says aha um you're going to get 7.6 units of reward um discounted over the rest of your trajectory now when we do action value function approximation we've got two choices um we sometimes call this action in and action out um action value function approximation so in the action in case what we're going to do is say okay I'm in this state here and I'm considering this action here how good would that be and there's this black box here which is fitting some function across all states and all actions using some parameter Vector W and it spits out its estimate of how good it is to take that particular action in that particular State sometimes it's more efficient to use a different form it's equivalent in its um the the actual data that it's representing is the same but it leads to a different training algorithm and sometimes a more efficient or less efficient result depending on the context um and and now we're just going to say herey the state comes in and we want our function approximator to tell us the value of all actions that I might take so just the advantage of this is in a single forward pass this thing can spit out the the values of all the different actions so we use this in Atari for example in one forward pass you can say how good is it to move the joystick left to move the joystick right up down press the fire button and just in one forward pass of your neural network you you get all of the information required to be able to make a decision you don't have to feed it back in with different actions again and again and again um so these are all mechanisms for Value function approximation and basic idea you should be trying to get your head around is that there's some representation of how good it is to be in a particular State and we're trying to generalize across the whole state space using this sort of tunable parameter Vector we're going to try and fit something true to the True Value function so we want the shape of this internal kind of curve to this surface we're building to be as close as possible to the True Value function so that now for any state that we query we get the right answer I'm going to train these things up to give us the right answer and the way we do that is how the the reinforcement L comes in is that clear so far so let's open up the Black Box a little bit now um and ask the question well you know what should it be you know what type of function approximator so hopefully um during your experience of the uh machine learning Masters or whatever else you're doing um you've come across a whole variety of different function approximators for supervised learning or perhaps unsupervised learning um some of the most common ones are listed here linear combinations of features neural networks decision Trees nearest neighbor for or W bases um there's a whole bunch of different things you might try um and you might ask well how can I navigate this C of possible different function approximators um so for reinforcement learning well at least for this class what we're going to do is focus on differentiable function approximators so these are function approximators where it's um relatively straightforward to adjust the parameter Vector because we know the gradient of our function approximator we know how if we change these W's um these things tell us if we change the W's how will that affect the the V that comes out so we know that gradient if we know that gradient there's a there's a set of tools that we can bring to bear um which allow us to do um a certain class of of algorithms that can be quite efficient incremental have nice properties um so for example we're going to focus today on linear combinations of features as the canonical case um we'll see a little bit more what that means um but also also allow for nonlinear function approximators like neural networks um and these are examples of differentiable function approximators where we know exactly how you know if you change the W on your neural net we know exactly how that affects the the outcome um because you do a backward pass on your neural network that gives you your gradient similarly with these linear combinations we know exactly how the graded these things so that's the representation we want a representation that's differentiable but what about training these things well what's special about reinforcement learning compared to supervised learning um is that in practice we end up with this non-stationary sequence of of value functions that we're trying to estimate um if we're trying to estimate V of Pi with this thing we want this thing to evaluate our current policy but our policy is changing our policy is going to improve like when we start to do control we're going to start off with say some random policy we want to evaluate that random policy fit a value function to it and then we're going to improve our policy and we want to fit that new better policy and then we're going to change the shape of this is going to sort of slightly adjust as we get better and better at navigating our domain sometimes it dramatically changes you know you figure out how to get into a new room that you've never seen before the shape of your value function suddenly changes because you discovered a big treasure in that room or a pit of snakes or whatever it is um so so we need to allow for non-stationarity in our um in our function approximator in the way we train them um and we need to allow for non IID data this is not supervised learning the data that as it arrives it arrives in trajectories we're following and you know where I am now is very highly correlated with where I am at the next step um and so IID training methods typically um not effective or at least we need to be smart about how we make use of them um so that's the challenge we're going to try and make use of these tools put them together in a way that lets us do RL at scale and the methods are relatively straightforward okay so we're going to start by looking at incremental ways to do this using stochastic gradient descent to achieve incremental value function approximation and then we're going to move on to batch methods such as Le squares methods okay so let's start off with gradient descent um so first of all show of hands so who is familiar with stochastic gradient descent good okay I think that was almost everyone um I'll go through it briefly um I think it's good just to make the notation clear make sure even if you're familiar that everyone's we're going to make heavy use of this over the next few slides um so we're going to consider some um function this is just gradient descent in general before we move to our value function approximation so let's just consider some differentiable function J with parameter Vector W and we're going to use this definition of gradient this sort of um U grad W notation um and what we're going to find the gradient Vector to be um is the gradient of partial derivatives with respect to each of our parameters in ter so this Vector basically tells us the direction of steepest descent or steepest ascent and we're going to follow this downhill so this thing tells us you know if I'm on some surface um which way is it to go down and I want to follow that down my Surface um in the steepest possible way and our algorithm is simply to say to find a minimum of this we're just going to adjust our parameters in the downhill Direction so we're adjust our parameters to move down and down and down um until we minimize this um this function which we haven't defined but this is our objective function for some objective that gives us a surface we want to move down um to find a minimum of that objective function and all we're going to do is adjust our parameters a little bit downwards that's the minus here the half is just to make everything neat in the mass some step size um in the direction of our gradient so that's gradient descent and so what we're going to do is plug this into value function approximation so for this slide let's imagine that someone tells us um V Pi let's imagine we're doing supervised learning for a minute let's imagine there's an oracle that says here's the correct value for V Pi so what would what would it look like what would our algorithm look like um so if we had this V Pi Oracle then we could just minimize the mean squared ER like some the Oracle says aha you should have done this um your value should for this state should have been 7.3 um but you actually got a value of seven so that gives you an error term of of three um and in expectation over all the states that you visit there's some um expected error and what we want to do is to adjust our parameters in the direction that minimizes that expected error the mean squared error um so we can do that using gradient descent um so we just again move a little bit in the downhill Direction adjust our parameters downhill um now our J that we're going to plug in is this mean squared error so our objective is to minimize the mean squared error between what we thought the value was going to be and what the the theorical tells us the value should be so we want to minimize that error um by gradient descent um so all we're going to do is plug that in so we've got this mean squared error here um so how do we minimize this thing well we're going to apply the chain rule to this so we've got this squared error here so if we apply the chain rule to this we end up with this error term multiplied by the gradient of this error term um but our Oracle is just some constant it doesn't depend on our parameters so the gradient of this thing is just the gradient of our function approximator we get this term here so to do gradient descent all we need to do is move a little bit so forget the expectation for a minute that's just going to be averaging over all the samples we see um going to move a little bit in the direction of the eror that we see in each state multiplied by the gradient um and then the way to deal with the expectation is by using stochastic gradient to set so instead of doing um full gradient instead of explicitly Computing this expectation what we're going to do is we're going to sample a state um so we're going to randomly sample a state by just seeing which state we visited we're going to look at what the Oracle says in that state V Pi we're going to look at our estimate this is what we thought the value was going to be we're going to look at the error term here and we say okay we'd like to adjust our parameters so as to correct for that error term so we've got a step size multiplied by the error and now multiplied by the gradient so the gradient essentially tells you how to adjust the parameters so as to um move you in a particular direction so you've got this error that you want to correct and the gradient tells you how to correct it and so in expectation you can see that if you visit all states um according to a distribution um under your policy Pi then in expectation we're going to arrive back at minimizing the mean squared error and the beautiful thing about stochastic gradient percent is that even if you change things online um um then you still arrive at minimizing this um particular ter you control for step sizes and so forth appropriately so literally every single step incrementally online all you do is you you take a step you look at where you've arrived you make a prediction of what your value is going to be the Oracle tells you what the value should have been you immediately adjust your weights and you move on to the next step and if you keep doing that repeatedly um then stochastic approximation Theory tells us that this really will um move you downhill and minimize the mean squ error between your function approximator and the Oracle you're fitting the oracle's predictions across your whole state space that you're visiting and we don't need to know or worry about this expectation anymore okay so that's is that clear so far apart from the fact that we've cheated so clearly this is cheating right we've used this Oracle um so um so we're going to start to deal with that ah before we do that let's try and get concrete about what this looks like in at least a couple of special cases um so this is probably the commonest case that you see you know if you sample a random paper from the reinforcement learning literature you would see an awful lot of um linear function approximation using features so what do we mean by features well let's define a feature Vector so a feature Vector is basically um just um each of these features is something which is just something which tells you something about your state space it can be anything you like it could be like a a landmark you know how far as my robot from each of these landmarks could be my my feature Vector this could say I'm you know I'm U 3 meters from the wall over there and I'm 2 m from the wall over there and that could be my representation of where I am in the state space and the feature of throws away a lot of information I'm no longer considering exactly what I'm seeing and I'm compressing all of my knowledge into this feature Vector but if these features are good it can make the learning problem much easier um it could be you know some Trends in the stock market or if we're playing a game it could be you know which configurations of pieces or porns are present on a chess board um these are all valid choices of a feature vector and each feature just is just one thing one one number um telling you um some piece of information about your state and the collection of these features is trying to summarize in some compact way what's going on in the state okay so let's assume for now that someone gives us a good feature Vector so we're not going to address the discovery problem of where these features come from if you want to do that you can you know use a nonlinear function approximator like in your network or all kinds of more sophisticated methods that we won't go into um and now how do we use these features well the simplest way to make use of the features is to make a linear combination of those features so we're going to estimate the value function by basically combining them together um with a weighted sub so this basically says I'm now going to estimate the value function of being here um by saying okay so I'm um I'm going to have some weight that tells me how good I think it is to be a certain distance from this wall and how good I think it is to be a certain distance from this wall and I'm just going to take a weighted some of those things so you know maybe I think I get one more unit of reward um for each each time I step away from this wall um and minus half a unit of reward each time I step away from this wall and I would sum those together so that for any position I can work out um exactly my estimate of how much future reward I'm going to get based just on those two features and so you know clearly this is not going to be a perfect representation unless our features are extremely sophisticated um but it can give a very Compact and easy to work with representation of of the state space give us a very nice way to represent our value function um so so basically what this looks like then is we say for each of our features we're going to take a particular weight this is the jth element of our feature Vector um and we're going to take the jth element of our weight Vector now and we're just going to kind of sum these things together or equivalently we take a DOT product of our feature Vector with our weight vector and that estimates how much feuture reward we're going to get so the nice thing one of the nice things about U this sort of linear value function approximation is the objective function this mean squ error that we were looking at that we want to optimize like the our measure of fit between U the True Value function and our our estimated value is now going to be um convex it's going to be a quadratic um you can see it's a quadratic U because if we consider our parameters W we've got this squared error here and so if we consider our mean squ error uh we just ignore all the other terms you can see that this thing is quadratic in w okay so that means there's some Bowl um or other quadratic shape turns out to be a bowl um representing uh basically the the um our mean squared error the objective function and that's a very easy shape to optimize using standard optimization methods in particular if we follow gradient descent and we just move down this mean squ error we always take a downhill step we will get to the bottom of our B we will find the optimum mean squ error so the nice thing about using linear combinations of features is that when we optimize them you never get stuck in some local Optimum you always get the the best correct answer if you follow your gradient for long enough and the update rule is particularly simple so we find the global Optimum using this very simple update rule um and the reason it's so simple is because the gradient of this thing here um is just the feature Vector the gradient of this with the respect all weights is just the features we're just everything is linear in our in our um in our weights so we just need to multiply by our features and so the update rule becomes very simple which is to say the change to W the change to our our weights of our linear combination again if we've given an oracle here we're going to look at what the Oracle said the value of this state would be so you know I'm over here I look at my features I see how much I estimate I'm going to come up with some estimate and say okay I think I'm going to get you know six units of reward see my error with what the Oracle said and now I'm going to move a little bit stepsize in a direction of how to correct this error and what we're going to do is we're going to basically look at each of the features and say you know whichever one of these features is on or or or highest is going to take the most credit for that error which I see so we basically move things in the direction where to adjust things by basically changing the weights which have the most active features so if a feature is off like imagine you're playing chess and you've got a feature to say you know have I got a queen and now you see some error um in a position where you don't have a queen um and you actually end up thinking now you're going to lose the game before you thought you were going to win the game you wouldn't update that feature because you were in a position where you didn't even have a queen at all so you only adjust things where where the features are essentially on and active you don't make any change to to situations where your features are off or or inactive okay so linear combination of features very simple very straightforward it basically boils down to this very simple update you move your your weights a little bit and the direction of the error multiplied by your feature vector and the feature Vector tells you how to adjust things um so as to correct in the direction of where where things were really on like if I'm if I'm really far away from this wall and I'm estimating that the further away out from this wall the more reward I'm going to get um then that wall wall that feature from that wall is going to be you know giant that's going to be the thing which should dominate This this term here so I really adjust that weight more okay clear I want to connect what we've just talked about with the previous classes so I you know I really don't want you to have the feeling that oh now we're talking about value function approximation and in the last class we were talking about table lookup and these are two totally different classes of they're the same like table lookup is a special case of linear function approximation um and we can see that by basically defining some special features and we can basically build up we can build up this this special feature Vector that says am I in state one if I'm in state one have a a feature value of one if I'm not in state one have a feature value of zero we can do the same thing for state two same thing all the way down to the final state so now we're can to have this enormous feature Vector with one feature for each state we're in um and all of them are going to be zero except for the Fe the state which we're in right now just going to say aha I'm actually in state s 734 um and that one would have a one and everything else would have a zero if we use that representation we use this x table as our feature Vector um then when we take our DOT product between our feature vector and our weights we see that we're just picking out out one of those weights so we've basically got a weight for every single um entry of our table we've got a weight for every single state in our state space And depending on which state we're in we're just picking out one entry from our table so this thing just lets us select the right entry from our table and now we're back to having a table lookup representation this is our table okay we've got one entry for each state we pick out the entry of that state according to the feature the state the state we're actually in time step okay take away from this slide table look at representations are a special case of value function approximation we don't need special Machinery to deal with them everything we've seen seen so far is a special case of what we'll talk about today um it's just a you know particularly convenient um mechanism to to use or to think about okay so we cheated we've really cheated we've imagined the existence of this Oracle um that's totally out of the spirit of RL I mean this just doesn't happen in practice you know the whole point of reinforcement learning is there's no supervisor to tell us hey the right answer was 7.3 um we need to figure that out directly from experience so how do we do that what's the um what's the method we use well we're going to use the same fundamental methods that we've seen in the last two lectures we're going to use um Monte Carlo and temporal difference learning methods um to give us a Target to use for our function approximator um instead of the Oracle so so far we basically assumed that we were given this V Pi of s but we know that we've seen in the previous lectures that that we can get um some targets that estimate V Pi of s directly from experience how do we do that well if we use Monte Carlo learning then this target will just be the return so we use the return as an unbiased estimator of B pi and so the idea is that if we substitute in now we're just going to do this by substitution so think of this as like uh um you know copy and paste exercise and that we've developed an algorithm for supervised learning and now what we're going to do is wherever we saw V Pi before we're going to copy and paste some other Target instead that we arrive at through a reinforcement learning method so the first thing which we're going to paste in is the is the return so we're going to basically we take the return here and what this algorithm says is you know the change to our feature ventor or to our weights sorry the change to our weights is we're going to move a little bit um in the direction of the error between the actual return so you know now going to say I'm going to start here I'm going to estimate using my function approximator I'm going to come up with some estimate that says I'm going to get 10 units of reward and then I'm going to see what happens run my entire trajectory out discover that I actually get 12 units of reward generate an error of two units of reward um and update my weights a little bit in the direction of that error multiplied by the gradient of my function approximator that tells me how to move in the direction that changes that um so change my values um so that's how to plug in Monte Carlo we basically take our our supervised targets our V pies and we replace them with a return so now we're just updating things we're fitting our value function towards the returns the returns become the supervisor so we're kind of doing supervisor learning on the returns is that clear if we use TD learning we're going to use a different Target um as our estimate of VP we can use our bootstrap estimate we can use the TD Target so now we're basically going to say um instead of waiting all the way until the end of the episode and getting some estimate of the return I'm going to use the TD Target I'm going to you know start in some state I'm going to estimate that I'm going to get 10 units of reward on this trajectory then I'm going to take a step see that I get say one unit of reward and then estimate that I'm going to get eight more units of reward um and so I'm going to say okay well my TD Target will be the the one plus eight so I think I'm going to get nine more units of reward and now we're just going to do just like supervised learning again we're going to treat that as our Target we adjust our function approximator so as to fit these values that we're predicting so now we're going to use that as a training sample we're going to try and fit our function approximator towards that example where we think we're going to get nine units of reward so we do that again by moving a little bit and the direction of the error between our nine units of reward and our 10 that we predicted before multiplied by the gradient that helps us achieve that to realize that that change we can do the same thing with TD Lambda um I won't go into the details of TD Lambda today you can look at the previous lectures same idea we use the Lambda return so this is the thing which sort of interp between td0 and multic Carlo learning whatever our Lambda return is we treat that as the Target and we fit those Lambda returns that we see we get use those as the targets for our function approximator so what's this look like um so we can think of this as a process that looks a lot like supervised learning so when we do Monte Carlo learning we're basically going to use the return and we're going to build some training data think of this as building training data but we're doing this kind of incrementally that we first of all see State S1 and we run a trajectory from it and we see that we got a return of G1 and then we visited state S2 and from S2 we got a return of G2 all the way up to our final State our trajectory or our wherever the agent got to and and it's final return here so that's the data that the agent seeing it's seeing this data it's seeing all of these states and from each of those States we've got some estimate of the the return which we're using from our Monte Carlo learning so our Monte Carlo we're just rolling out and seeing how much reward we actually got from each of those States think of that as like your training data and so now what we're going to do is basically just like supervised learning we treat this as a data set and we're basically going to adjust our function approximator so it's to fit the G's and estimate what the G's will be from any one of these s's so this is just like supervisor and the simplest case is using linear Monte Carlo policy evaluation so this the same as the last slide but now all we're going to do is substitute in our linear function approximation so the gradient just becomes the feature vector and all we do is we plug in the return so adjust the weights of our linear combination a little bit in the direction that corrects our estimated value towards the return um and we know that this has to work because we're just doing stochastic gradient descent castic gradient descent converges to a um an Optimum and we know that Monte Carlo learning gives us an unbias estimator of the of the true Target so essentially it's as close as you can get to an oracle it's just a noisy Oracle um and so this will work it just may be a little bit slow so even if we use nonlinear value function approximation Monte Carlo will converge to some local Optimum or in the linear case it will find the global Optimum we can do the same idea with TD learning um but now we've got this bias sample so we don't have the the true Oracle anymore we don't even have a noisy Oracle we just have some biased estimator of the target so you know what I think will happen after one step and then invoking my own function approximator again so you know I get the reward rt+ one and then I have to query my own function approximator to estimate how much reward I'm going to get for for the rest of the trajectory that's how we get the TD Target so this thing is biased because we're having to go through our own neural network or our own linear function approximation to figure out what these values will be um and so this gives us a bias sample of the True Value um but we can still once we've accepted that we've introduced a bit of bias we can do the same idea we can create this data set what's the data set look like now well it's like for each state um we have a TD Target it's like I was in this state and I saw one step of reward and then a value function estimating how much reward I was going to get from that point onwards um then I was in another state and I got one step of reward plus the value function after that um and I can build up a whole data set that basically uses these TD targets as like training data we start to fit our value function towards these um TD targets so let's do the simplest case again let's consider linear function approximation so now what we're going to do is plug in our TD Target so we're going to basically say we're going to adjust our weights a little bit in the direction of the error between what I thought the value was going to be um and my one step estimate of the the value the TD Target like you know after one step how much do I think I'm going to get treat that as the target we're going to see that there's some error like if you're playing a game this could be the thing that says you know before I played the move I thought I was winning and then I played my move it's like oh no I actually just saw that I blundered and I'm going to you know lose my queen um and so that generates an error between what you thought was going to happen and what actually happened we use that error signal to correct where where you started now you think oh I was actually losing all along I just hadn't realized it until I took that step and now this gradient tells you how to adjust your function approximator to make it more like you thought you were losing in the first place um and so we can compactly write that in the following way um we update our Vector step size multipli by TD eror multipli by feature vector or the gradient in that more General case feature Vector in the linear case radient and the mon linear case you had question I've been quite suspicious that you're all so quiet it's like you're normally like asking questions every five seconds good um so you're saying that generate this data set and super you do this after step or is it after great question so so I'm just actually trying to give a flavor of how this relates to supervised learning so you can think of what it means to do train this function approximator in practice we're doing everything incrementally that this is the first half of today's class incremental online update so every single step um we're going to take a step we're going to see um we're going to get our generate our TD Target so like I thought I was so I'm in this state I thought I was going to get some value I take my step I get some new estimate of the value I gener a TD error I update my weights immediately and then I move on to the next step so the data set idea we'll come back to later when we do batch methods but it's just to see that that there's a relationship between this and supervised learning that we're doing something where we're we we're associating for each state we're associating each state to a to some Target and we're just playing with what that Target is we're playing with whether that Target's the you know an oracle whether it's the Monte Carlo Target whether it's the TD Target or whether it's like a TD Lambda a Lambda return okay and even though this is biased it's really important to note that there's a you know famous result by uh titicus in Van Roy that shows that linear temporal difference learning still converges if you use linear function approximation um despite the fact that we've introduce this bias it still converges close to the global Optimum where close depends on things like your discount Factor and then finally I won't dwell on this we can do the same idea with with TD Lambda and now the the data set if you like the data that we're generating again we're still doing this online um but the the association that we're making is each state we're associating with a um with some Lambda return now so the Lambda return is this mixture of all your endep returns um and so we're going to try and basically learn to make our value function fit it to these Lambda returns we want to take some weighted combination Lambda weighted combination of what I think is going to happen after one step after two steps all the way up to infinite steps and make our weighted combination of all of those estimates call that our Lambda return and then fit our value function at each of these states towards those targets and we're going to do this incrementally step by step but conceptually there is this data set that we're kind of moving through um and so we can do that either using the forward view so the forward view has this flavor where we move a little bit in the direction of the the Lambda return the error between the Lambda return and we thought the value was going to be multiplied by the gradient in the nonlinear case or the feature Vector in the linear case um and there's an equivalence again just as we saw in previous classes there's an equivalence to that this backward view where you can use eligibility traces and you can make something which is now truly we again we want to achieve online updates where you take a step you update and you can do that using eligibility traces so briefly um what do the eligibility traces look like I think someone asked in a previous class you know what's the size of these eligibility traces and and when you use function approximation the eligibility traces are eligibilities now like on the features or on the weights of your your parameter Vector they say they're they're the size of your parameters of your function approximator they're not the size of the state space so the eligibility basically says look each time you take a step we're going to Decay our eligibility a little bit but then increase it um in the direction of all the features that you've seen so this is kind of remembering all the features that you've seen so far according to their magnitude so so the eligibility Trace is kind of accumulating credit for the things that happen most and the things that happen recently and then it decays over time again so kind of each step we add on hey you know this feature was on um you know I had a queen at this step um and I had um seven pawns so those are the things which can contribute at this step maybe at the next step I only have seven Pawns those would be the features that contribute there you kind of take the sum of all the things which are on and active um or the large features and they kind of get um the more something occurs in your in your eligibility the higher your eligibility is pushed and that feature will then get updated the most when you actually see an error and then the TD arrow is generated in the usual way and we update our weights now a little bit in the direction of the TD error multiplied by this eligibility vector we update our weights proportional to the TD error and how much each of those weights is BL is to blame for the error that we've seen and without going into details again there's an equivalence where if you do this at the end of the episode if you were to just gather up all your changes until the end of the episode you would find that you make the same updates forward and backward these things are equivalent right so just before we do control is everyone good so far this is sort of this is like the main the main point of this class is these last few slides and if you if you're lost you should ask now because it's just going to get you know we're going to build on it and um do more things yeah yes okay great question so the question was why does the gradient only depend on our predicted value function let's go back to the simplest case with td0 and because we can still ask that question already here so the question is um when we take the gradient why don't we take the gradient of both the Target and our function approximation why is the gradient only of um of our function approximator um the answer is it's a slightly tricky answer okay uh one way to answer this is to say um we're doing TD TD means that you kind of um you're always pushing things towards what happens later because you trust the thing which happens later you don't want to reverse the flow of time you don't want to update um the estimate of where you end up end up later towards what happened before like if you have the gradient of both of these things it's like you kind of got a spring and you're you're pulling your value function um of what happened afterwards towards the value function of what happened before you're pulling them all together to try and reduce the error and you can show there's a very simple example where you can show even in a table lookup um situation with a stochastic mdp if you do that you will get the wrong answer you will not find the the correct value function um and the reason is that this um Trying to minimize for this error in this way doesn't um you need the flow of information to be grounded you need to ground things and the way you ground things is to go later in time like as you go later and later in time you see more and more real rewards and the more real rewards you see the more you can trust what you've seen in terms of what's actually happened you don't want to reverse the flow of time and and update something where you saw a real reward um towards something where you hadn't seen that reward yet you will kind of get the wrong answer if you do that and so putting the gradient on both terms is doing a little bit of that time reversal um it's doing both the forward and the backward Direction um and there are variants of that which are um not quite so bad as the way I've made it seem the the family of methods is called residual gradient methods and so it's a well-known idea people do this in practice but it's also well known that if you just naively plug in the gradient here of this term as well so if you just had the gradient of um of the error between these guys you would get the wrong answer um you have to be a bit clever about taking expectations and then that algorithm isn't actually practical and you can't do it in um in incremental settings you two samples and this version works well in practice um and you just need to be careful it's it's intuitively it's appealing um when you first think about it to do what you suggest but um I would recommend against doing that unless you really know what you're doing um but there are cases where where there's where it's a reasonable thing okay I'll come back to a related question in minute but thank you that was a great question okay let's move on to control so we're going to use the same idea for control that we saw in the last lecture where we're going to still build on our idea of generalized policy iteration um but now we're going to use approximate policy evaluation so what we're going to do is we're going to start off with some parameter Vector now and we're going to say okay that parameter Vector that's going to Define um some value function what we're going to do is basically act greedily with a little bit of Epsilon exploration for example um with respect to that U value function that we've defined so this might be our neural network we act greedy with respect to our neural network that gives us a new policy and now we want to evaluate that new policy so we run some more data we update the parameters of our neural network that gives us a new value function um and now that new value function we can act gely with respect to that and so forth um and again just like in the last lecture we're not going to go all the way to the top here we're not going to waste millions of samples of experience trying to exactly fit our function approximator indeed we know that if we're using a function approximator we're not going to precisely be able to achieve equality here ever anyway so what we're going to do instead is take some steps towards it um and then immediately update our policy and in the most incremental case where we just do um TD for one step we're going to do this every single step every single step we're going to um start evaluating our our policy we're going to update our neural network once slightly adjust the shape of our of our curve and immediately act with respect to that latest neural network to pick our actions so there's no reason to wait until you've really adjusted your neural network an awful lot before you start picking actions again you know why not always use the freshest most interesting information you've seen um and so that's the idea of this um approach here um and we'll see some variants later but um you can do policy evaluation where you do approximate policy evaluation um so again we're trying to estimate the action value function now so so let's just back up like last class we saw that there were two ideas in order to do control we need to take generalized policy iteration generalized um policy iteration and add two new ingredients where we need to use the action value function Q so that we could be model free and still be able to pick the greedy action and we need to have some kind of um exploration uh for example Epsilon greedy those were the two ingredients that we introduced last week um so we're going to use those same two ingredients but now we're going to use our function approximator to get an approximate Q value so think of this now like a neural network representing Q we're just going to figure out the parameters of that neural network as best we can then act greedy with respect to it throw in some noise to get some exploration continue our algorithm so what happens does this really find the right answer does this get to qar well of course not because it might not even be possible to represent qar anymore we we've used some approximation so we should expect in the you know the best possible case is that we will um have some controlled algorithm which gets closer and closer and closer to qar um but it turns out even that is too much to hope for in the control case we typically end up with algorithms which sort of oscillate around um and sometimes they get a little bit further before they come back in again and there's some ball within which they oscillate and they can kind of come in and out but in practice they tend to get very close close to the right answer and so all as well roughly so the first ingredient is we need to do all the same things using Q instead of V okay we need to think of how do we do this with an action value function so let's just back up we'll do the same steps again using Q just to make sure it's all embedded nicely um so we're going to approximate the action value function so now for any state and any action we're going to build a function that predicts um with these parameters W that predicts how much reward we expect to get from that state in action onwards and then we're going to minimize the mean squared error now between our our Q values and the True Q values for our policy and so we build up our mean squared error this is the expectation under our policy so um with respect to the states that we'll see if we just follow our policy and take the actions that we see we look at the squared error an expectation and we do stochastic gradient descent again again um so just the chain roll we have our error multiply by the gradient um and so now what we're doing is we're moving a little bit in the direction of the error between the Q value that I estimated before and the Q value that my Oracle gives me if we have an oracle multiplied by the gradient um and now what does our function approximator look like um well you know let's consider the simplest case again using linear function approximation we can now build features of both the state and action so this might be you know we might have one feature saying how far am I away from from this wall if I'm moving forwards that might be one feature and I might have another feature that says how far am I away from this wall if I'm moving backwards these are like functions of both the state and action which I take and there just going to be some number telling us each of these cases and maybe it's just zero if I'm not doing the action chain was interested in or maybe we generalize AC cross actions um and so that's going to give us features a feature Vector that tells us about the whole state action space and so we can build up a picture of what this whole combined State and action space looks like and what's the value of all of these things and the way we do that the simplest way is by building a linear combination of features you could also do something more sophisticated like in neur OWN Network and then just follow the gradient so a gradient update now collapses to change the weights the direction the step size multipli by the error multipli by just the features um whichever features are on so this would basically in the example I gave if I actually go forwards then we would only update the the feature corresponding to my forward action because the other one would have a zero value okay we can plug in the same idea so this is just the same Machinery again I'm not going to dwell on it for too long where we can basically do exactly the same things with Q um we can use Monte Carlo learning so now if there's no Oracle what do we do well again we can use um the return as a noisy um unbiased estimate of of the Oracle um we can use the one step TD Target or we can use the the Lambda return or we can do the backward view TD Lambda to use eligibility traces so all of these are just as before but now we're using Q rather than V and the reason to use Q rather than V is so we can do control once we have q we can just pick the max over our actions uh we don't need a model we can figure out what the next action we should take is um and continue to do our policy Improvement steps okay let's do an example there's been a lot of slides without getting concrete so this is probably the most widely used example in all of reinforcement learning so if you were to pick up like a random paper on reinforcement loan there's a very good chance it would have this mounting car problem in it um I'm kind of sick of seeing it now so um um but you know still I think it's a good problem and it's interesting to think about um the idea is you've got this car here it's stuck in this dip and you can see it's kind kind of steep here so how you going to get out of this dip the assumption is that your car isn't powerful enough to just drive straight to its goal here so what it has to do is has to kind of um just let go and roll backwards roll Drive forwards roll backwards Drive forwards again and build up enough um momentum to actually be able to reach the goal there um and then the question is well how do you figure out that control strategy how do you just do this um not telling us not saying anything about the mdp about the Dynamics you know do this all model free we want to solve this um and and now we've got we want to do this using function approximation so what's the state space in this problem well the state space is the the position of this car in this continuous State space um and it's also got some velocity um and so you can think of this as a two-dimensional State space position in one axis and velocity in the other axis and so the value function looks something like this you've got surface saying for each situation this car might be in like if it's here um and going downhill at some particular velocity um you might find yourself you know at some point on this diagram you want to know well how good is that how much how much reward will you get and of course like usual how much reward you'll get depends on the policy you're following so to begin with if you're just acting randomly then the value function looks something like this um and so over time what we see is that as the policy starts to improve prove we start to see more and more shape emerging um out of this um function approximator this is doing our generalized policy Improvement using the Sasa algorithm okay so Sasa if you remember it's literally um the extreme case of of this diagram here it's extreme case of this diagram where every single step we update our value function in this case using um a linear function approximated to estimate Q every single step we update q and every single step we act greedily with respect to Q to pick our next action um but we also flip a coin to see if we do something random to make sure we explore that's Sasa that's all it does um and the way that we update Q um is using the onestep TD returns um so these guys here so what's happening is we're on on our Moun car um where you know in some situation um we have some pred of what Q is going to be for um the action so there we take some action like trying to apply some acceleration um we've got some prediction of how much value we're going to get then we see that we actually end up falling backwards back down the hill a bit we see how much um Q we now predict in our new state and the new action we will take from there and we adjust our function approximator towards that new Q value um so we think that where I was before and applying an acceleration that the value should be adjust adjusted the value of this curve should be adjusted a little bit in the direction of those Q values um and what that does is it's effectively pulling up or down this function approximator in the direction of what we now think the value should be so every single step you see you kind of pulling up or pushing down this function approximator the shape of this curve this shape of this surface um and you know if you took a step and ended up in some situation where you think you're actually doing much better now you suddenly reach the goal you know that will pull up whole shape of this function approximator and this particular function approximator that's used for this example is what's called a course code which is like basically lots of overlapping tiles of different shapes and sizes um and they all overlap and anything which is actually active which is on will be pushed up um when a particular um state is visited so you visit a state you think aha the features which are on are all the ones nearby this they all get pulled up none of the other ones get affected and you get generalization around the region of the tiles which are sort of active at that point in time okay so that very simple function function approximator is sufficient to express quite a complicated um shape in in the value function and you see that there's this these curves and and complicated Surface starts to emerge here and you get these spiral patterns appearing until eventually you end up with something which looks like this so This is actually using a different type of function approximator rad radial basis function approximator you end up with a this kind of spiral pattern um in the position of velocity space um it's a really old figure you can tell it's got handwriting on it w um and and you get this shape appearing and this is an approximation of the optimal value function for Moun car um and so nothing is exact like we we're generalizing no one we don't precisely know what happens for some you know some point in between these things this this is a smooth function approximator that's interpolating within its tiles and and estimating some of these values but you can see that it works very well at getting this this space and and this also scales up to much more complicated problems as well okay does that give a bit more of a flavor what's going on good question in this example good question sorry I in this example there is a um continuous State space which is the position of velocity the action space um you can do it in different ways but um in this particular example the action space is either uh is is discrete so you basically get discrete choices of acceleration so you either cannot accelerate or accelerate um and the amount the accelerate is given um and for some class of interesting of problems like this often there's a class of problems called bang bang control problems where actually it's not interesting to consider the space between um maximum acceleration and and minimum acceleration because you would always choose to maximize to one of those extremes um and so yeah that's but you can also do this problem with continuous control but that's a different that would that would require so Sasa would still work the only question there is how you compute the max if you're if you've got continuous action space Computing the max is tricky because you've got to do an optimization just to even find the max Yeah question so the question is couldn't we use a different representation of State space to get a um a simpler shaped um value function um so that's a great question um so so I think the right way to think of this is to say that the state is what's given as part of the description of the mdp and this is the true shape of the value function now to learn this effectively you can put in any features you want So if you think that there's some features like the features that you provide can indeed transform the shape of the value function to something much simpler that's one of the reasons to use features so if you know good features you can transform this complicated spiral into something flat or something simple to learn um and that's putting knowledge in to simplify the shape of the problem but this is the value function this is V Pi you know V Pi looks something like this and we're just trying to approximate V Pi um and so yes indeed you can put knowledge in to help um figure out this shape of of VP more effective i l okay um so so we talked a little bit about using Lambda there um so I just want to throw out this slide which is like an empirical slide it's an old one but I think the the lesson is has been borne out time and time again since then so I think it's still useful to see um so forget the accumulating and replacing traces that's just two slightly different variants in how you use elabel traces the story is roughly that across a bunch of different environments um when you use function approximation and and something like Sasa the question is should you use eligibility traces and you almost always um see some kind of picture like this um this is a cart pole example this is Mountain car and you almost always have this picture where if you go all the way up to Lambda equals 1 and use Monte Carlo returns you know performance is really bad it takes an awful lot of samples to actually get the right U result the variance is too high and you're killed by variance if you go all the way to td0 you do better than Monte Carlo bootstrapping that gives gains you a lot of efficiency but there's normally some kind of sweet spot in between um and the shape of that sweet spot depends on the problem and all kinds of things but but typically you know having some lamp bootstrapping helps um and and Lambda can find you something better than td0 but the main point of this slide is to say bootstrapping typically helps and so it's worth finding algorithms that are effective when we bootstrap um and so the natural question is well is it sound to boot strap like we know that following Monte Carlo is just like this noisy Oracle we're doing something just like supervised learning and it must converge but is that true for U for TD methods do they always converge um and let me put up one situation where they don't so I'm just going to flash this at you without explaining the example if you were to do something like this simple mdp um and you were to update instead of instead of using Sasa and actually running on policy if you were to update this by um by doing sweeps over the whole state space and doing your updates on on all states you end up with um weights blowing up basically this is the number of iterations you get and the weights can blow up so TD isn't guaranteed to be a stable algorithm there are situations where you can apply TD where it can blow up and so it's important to know when it's safe to use TD and when it's not um so this is like a one slide summary that roughly tells us um when it's okay to use TD when we're guaranteed for TD to converge to something um and when there's a possibility that it might diverge so the crosses mean there's a possibility of Divergence tick mean guaranteed convergence to something close to the right answer um and so now we can consider if we're running on policy so on policy remember means you know we're actually learning from the behavior we're we're running so we run our trajectory and while we're running that trajectory we're learning about that policy that we're following um so Monte Carlo great okay it converges for take a look up for L for nonlinear this is all for just policy evaluation so far we're not talking about control on this slide but for TD algorithms if you run on policy and use nonlinear function approximation no guarantees it can blow up like the bootstrapping process can cause Divergence in in theory in practice it tends to work okay lots of success stories um but you can certainly construct counter examples where it blows up of policy even using linear function approximation um it's possible to construct counter examples like the one we just saw where TD blows up um so off policy learning um can be a little bit problematic um and I just want to mention some more recent methods so methods such as gradient TD also in the last few months there's another method called emphatic TD U these are methods which fix the issues um that the TD algorithm has when it bootstraps um and we can see that for example gradient TD this is a true gradient descent approach so instead of just doing this gradient descent and then substitute in the TD Target and hope that it's still a gradient um this is something which really follows the gradient of the projected Bellman error like that's particular objective function follows a true gradient I don't want to get into the details now but you can look it up and so it is possible just with a small correction term that you basically end up with the same update with one more correction term and it fixes the problems that we see with um with TD learning so you can get convergence but you have to um just be a little more careful with TD what about when we do control um control is surprisingly problematic like this whole framework of generalized policy iteration it's very appealing it gets very efficient algorithms you got these very aggressive steps where you take the greedy policy Improvement step and you really try and make your your policy as as good as it can be at every time step but essentially we very rarely have guarantees um in fact almost always we get chattering um where which basically means that each time you improve your policy you might actually make U the the uh there's no guarantee once you use function approximation that that your improvement step is really improving the policy um you might take a greedy an Epson greedy step but we don't have this guarantee anymore that that's actually making progress at every step and you can kind of make progress make progress and then take a step away and make progress make progress take a step away and end up chattering around the optimal um policy or the optimal value instead of ever converging to it um so cross here means catastrophic Divergence is possible this means this means that you chatter like you're always getting closer but occasionally stepping away but you never just kind of shoot off to Infinity right um okay questions before before we start running out of time good yeah yeah I just want to know when is the weights that that going to be fixed at certain so what does convergence mean convergence means that the parameters of your function approximator um um converge to some fix Point um and a secondary issue is that fix point the right fix point that we want um and so what this really means is that we converge to parameter Vector that's close to the best parameter Vector in the space for our function approximator okay yeah you said you can get catastrophic diverence in Theory are there any of those xes where you would often get catastrophic diverence in practice as a as a common observation um so yes we'll we'll talk about that in just a minute so um so one of the so so using neural networks if you just naively throw in say Q learning or or or Sasa or something like this and and and hope that it will work um it very often blows up with Within neur nwor and that's because you know if you use some fun non-linear function approximator with like these steep um sides you can really dramatically alter your your policy and you can your gradient can be very large you can be kind of shoot off to some new situation you can end up with these uh quite complex um Cycles where where you think you're making progress but but now you change your policy and that the value function changes in some way that changes the shape of this thing and you can end up spiraling outwards it's it's possible and yes it happens um with linear function approximation I would say in practice even if you're running off policy um where we technically have um a cross there mostly it's okay you sort of have to work a bit harder to to break the algorithm and I would say in practice you don't necessarily observe um Divergence um and I'm going to talk about how to address the neural network issues in the next part okay good so so I think we'll have time for talking about some of the batch methods but I'll leave some to to um a reading exercise afterwards okay so let's begin with the motivation so so far we've talked about these gradient descent methods they're very simple the updates are really simple you just move a little bit in the downhill Direction look at your error a little bit that way but they're not sample efficient you know we we see our experience once we take our an update we we adjust our function approximator in the direction of that experience and then we throw that experience away and we move on to the next one that's not data efficient we haven't made the maximum use of that data to to find the best possible fit to our value function and hence the best possible policy um so the idea of batch methods is to try and find in some sense the best fitting value function to all of the data that we've seen in our batch um and the batch in this case is the agent's experience that's its training data it's its life like the you know life is one big training set for for an agent and um we're trying to basically now fit everything the agent seen find the best value function that that kind of explains all of the the rewards that it's experienced so far um so what does it mean to find the best fit well one definition would be something like a least squares fit um so now you know if we want to fit our function approximator to U the True Value function for our policy um we can again Define some data set this is just we did before we can say we've got some training set which is our St in our our targets like let's say there's an oracle again so let's say in state S1 the Oracle tells us the value should have been this thing so this thing is just whatever our Target is we're going to call it V1 Pi that's um the Oracle consulted for State S1 we consult the Oracle for State S2 it gives us another value that gives us our training set for all of the data that we've experienced we remember that whole data set and now the question is um you know what's the best fitting value function for this whole data set well we could do something like these squares which would find the parameter Vector that minimizes the Su squ error between the value function um and the Oracle across the whole data set so if we consider this whole data set and look at the squared error between what we thought would happen here in the Oracle and the squared error between what we thought would happen here in the Oracle and we want to just minimize the sum of all those squared errors across the whole data set so that's what the Le Square solution does so sum over all time steps U between the Oracle value and what we thought would happen all of those squ squ eror summed um and we can write that as an expectation over our empirical data set so this Ed means like an empirical distribution expectation over this Su basically over everything we've seen so far so we're trying to minimize some kind of mean squared error and the sum squared error over all the data we've seen so far so that's the least Square solution and it turns out there's a really easy way to find the least Square solution um and it uses all the methods we've seen so far and it's just called experience replay um and all it means is that now we make this data set a literal thing we actually store this data set so instead of throwing away our data we cach it we keep it in some big um we make our training set an explicitly stalled um object we have some experience replay memory and now all we do is every time step we're going to sample a state and a value from our experience so we were in state s um we're going to sample uh from this experience we're going to say okay in that state the Oracle said I should do this um sample that from our experience um and then do one stochastic gradient update towards that Target so this is just like supervisor learning again we've made a data set and now we're just going to randomly sample from that data set and update in the direction of our random samples um that standard supervised learning if you get a data set you always want to randomize over it and kind of um you don't just want to do the latest things randomize to avoid remove correlations um so this is taking our non IID data our whole trajectory and it's randomizing it's breaking up into all the little pieces so we're not presenting things in the order they arrive which are strongly correlated we're we're decorrelating things and and presenting them in some random order and just applying stochastic radient percent again and again and again and again until where do we get to we get to the least Square solution okay if we just keep doing this um this thing will find the least Square solution it will go over and over this data again doing gradient descent um and if you optimize for gradient descent for example using a linear function approximator you find the global minimum which is the least Square solution so if you go over your data you get more out of your data that's the idea it's very simple you remember everything you just keep going over your data and squeeze more from that data rather than just throwing it away after one step and moving on um so let's come back to the motivating example we saw in lecture one which is um this recent work we did with um in Atari um so in the Atari work we we we're now in a position basically to understand exactly what we did actually I mean it's the method's fairly simple um and uh and we can see that now we' we've basically got all the pieces in place okay and it just uses exactly what we've seen so far so we're going to use basically um experience replay and and basically Q learning um so it's off policy which is a a Nuance from what we've seen so far but the idea is really simple that we just remember um all of the transitions that we've seen so far so remember all of our states actions rewards next states that we see we remember them in some big replay memory um every step we take in action according to an Epsilon greedy policy with respect to our function approximator we've got a big function approximator representing Q we'll see the structure of that neural network in a minute it's just a big neural network that estimates all the Q values we use that to pick our actions by acting greedy with respect to values and then every step what we do is we sample some mini batches from our data set so mini batches not everything but just a few like say 16 of these um I think actually 64 we do at a time um you take you know 64 random samples from our replay memory so we remember say you know a million transitions sample 64 random um and follow the gradient with respect to those 64 things and we optimize the mean squ error between um what our Q network is predicting our action value function and our targets and the targets are just our something we plug in instead of the Oracle and the Oracle the thing which we plug in is like a q learning Target so it's um um it's just like the Sasa targets but with a Max there okay so we're just looking at the error between thought would happen according to our Q learning according to our Q Network so this is our functional approximator um prob in the error between that and um the reward plus the maximization over the actions that we might do next so this is just like in Q we look at um the best thing which I might do at the next step we want to optimize we want to say you know the value now we want to update it towards um the max over all the things that I might do next so maximize over all the things I might do at the next step use that as a a backup so that I now think the value at this point is as good as the best thing I might have done at the next step so we make things better and better and better um and then we optimize this using stochastic gradient descent um now I said earlier that um that methods like Sasa and TD methods can blow up with neural networks however um this method is stable with neural networks that's one of the contributions of why this worked so well and previous methods just didn't and blew up um and so there are two tricks that make this stable compared to say using naive Q learning which are these highlighted points here so the first point we've already seen which is experience replay experience replay stabilizes these neural network methods because it decorrelates the trajectories so instead of seeing these highly correlated parts of the trajectory one after the other after the other we randomize the order in which they arrive and now you kind of break the correlations and you get much more stable updates and the second idea is to do basically um have a second Network um so we actually keep two different Q networks um and with two different parameter vectors we basically freeze the old Network so we remember our old Network we freeze that for a while and we use the targets basically we we bootstrap towards our Frozen targets we don't bootstrap towards the latest freshest targets we bootstrap towards our targets from some steps ago and that also stabilizes the process it basically makes it just like supervised learning for the course of one iteration it's like for a while we move towards the Q values with respect to our old Network then after we spent you know a few thousand steps updating towards that old Network we switch it around and we we make our old Network equal to our new network and we and reiterate that procedure so we never move we never bootstrap directly towards the thing which we're updating at that moment because that can be unstable there's a lot of correlation between your targets and your Q values at that moment and with neural networks that again can cause things to spiral out of control okay um I'm sure there'll be questions right now so someone ask good sorry I just don't good okay I I realized I was rushing that so let me just spend a moment to say it more clearly so we have two parameter vectors we have um any iteration I okay what this is saying is we're going to optimize the loss function for that iteration um and what we're going to do is we're going to update our our latest parameters wi um what we're going to do is we're going to remember some old parameters so we're going to use some old parameters from say a thousand steps ago the old parameters going to be used to generate our Target values so we're going to bootstrap the Oracle now is basically going to the thing that we substitute for the Oracle um basically makes use of some old parameters we don't use our freshest most recent parameters um in our in our Target that we're plugging in in place of the Oracle we use some fixed old values and the reason is that we want to fix this because it gives us a more stable update we don't want the um the danger is that um with TD learning is that every time you update um your Q values you're also updating your target values like if you if this thing Network and this network are the same every time you update your parameters a little bit you're changing the targets to which you're you're moving them um and with nonlinear function approximation that can actually spiral out of control so all we do is we we stabilize that process by freezing this guy to be some old value and we just update a little bit towards the old Frozen values so what stage do you go back like what is theose um so it's a tune parameter where we basically have one fixed schedule for across all games where you know every few thousand um steps we move on to another iteration of and and adjust these things so you don't the frequencies you you can also D update them smoothly but um all the experiments I'll talk about would would were were like a a switching they were just kept Frozen you essentially are then doing supervised learning towards these targets it's like your targets are frozen for a while this is also called fitted Q iteration this idea you freeze your targets you update towards those targets and then you switch in your um your targets to your old targets become your new targets your old you swap your Q Network so you wait to the you don't completely wait you know you never get all the way there this is some enormous neural network but you just wait some period of time and then you swap and you find what works best in practice I wouldn't claim this is the most aesthetically satisfying idea but in practice it stabilizes everything and the experience replay is really important as well um so just to say this is the architecture that we use so it's a big convolutional neural network you don't know what a convolutional neural network is it's just something where you take your typically an image and you have some neural network with shared weights that's apply to the entire image region by region um and at the end of it we output Q values for all of our different actions of what the joystick might do everywhere um and we use the same neural network with the same hyper parameters same step size same schedule same rate at which we update things across 50 different Atari games um and these were the results so um this is from the nature paper which came out last week um and you see that um basically this is ordered by how well we do compare to a a human tester that we had in house like an expert human games player all of this point here everything to this side of it we're basically doing better than our expert human in-house gam player everything to the right we're doing worse there are some games which we do horribly at like monus's Revenge just turns out to be really hard um a lot of games where we do you know way better than human like in video pinball things which are very Twitchy um so so the algorithm is robust you know this indicates the robustness how well do those two ideas help to stabilize it um so we see experience replay makes a huge difference to the stability of the algorithm this is the performance on say five different games with and without replay and with and without this kind of idea of switching the Q networks and this shows that both of those ideas Help U and the combination of two of them makes it the most stable and effective okay so all of this was just an example of one simple idea which is stochastic gradient sent with experienced replay as a way to get squeeze more out of the data that you've seen so far I just wanted to make that um practical um but what you know are there methods where we can jump directly to the the least Square solution and the answer is yes for the special case of linear function approximation so in the remaining few minutes I just want to give you a flavor of what that looks like okay um so it's um so why should we care about this you know we have this method of experience replay It's very effective it's incremental it's anytime you kind of do as much replay as you want and you get closer and closer to the solution but sometimes it can take many iterations to really get to that least Square solution um and if we really are using linear function approximation there's um at least a you know with with more computation per iteration like you can do a little bit more in one St step and jump all the way to the correct V Square solution closed form okay um so it's like you can do your policy evaluation step um in one jump um and so it looks a bit like this so I'm just going to do it with one case using um using the Oracle again and then you'll kind of see how to plug in all of those different estimators whether it's the so we're going to consider some Oracle again V Pi T this is a Time step T what what's the Oracle say the value was and and and then you can plug in again your Monte Caro estimate or your TD Target or your Lambda return um you can plug in anything you want there um but the idea is really simple the idea is just to say you know at the minimum at the least Square solution what does it mean to be at the least Square solution well it means that we're not going to change our weights anymore like that we wouldn't choose to update our weights in any direction anymore because we've reached the the the minimum like if we were to keep going over our data again and again and again we wouldn't make any more changes because we we're happy we've reached the best solution we can get to okay so another way to say that is if we take the expectation over our data set like this sum squared error um of our updates that we would make we would want the sum squared error of all of our updates over a whole data set to be zero we don't want there to be any pressure to change this thing if this was not zero you know if this was um this was positive you know we would still want to be moving we' still choose to change our weights so that would mean we hadn't reached the the least Square solution yet um and then the rest of this is just expanding okay so this is unwrapping the sum squared um update so this is sorry I said the Su squar what I meant was the expected update expected update and the update depends on so if we unwrap the expected update and look at the update that we're making over a whole data set like how are we adjusting the weights over a whole data set we're basically saying for each sample um we've got um this is our update we' got like for each of these for each time step we multip moving a little bit in the direction of Step size times the error between um our linear function approximator here and the Oracle multiplied by the feature Vector that was our update that was the linear go back up and check if you don't believe me step size times error times the credit the gradient Vector which is just the features in the Min fation Cas and we want the sum of all of these things to be zero we want the the average gradient you know the average radient of our objective function to be zero and so we just collect these terms now we've got one term here um and another term with this x times this x here um that's just collecting terms and then we see that only one of these depends on W and so we've just got this big Matrix here which is the outer product of your features um so we can invert that to get the solution for w so with a little bit of linear algebra you can start off with this idea that you want your update across your whole data set to be zero a little bit within your algebra you can see that you can bang just by inverting a matrix you can get the right solution um now inverting a matrix we've been avoiding doing this because it's expensive um but the cost of this doesn't depend anymore on the number of states like this is depends only on the size of our feature Vector so if we've got a small number of features this is reasonable um so inverting it takes order and Cubed and better still you can use something called Morrison woodrey um trick for incremental Matrix inverse and you can do this in n s time so is possible to get more efficient update as got quadratic cost in the number of features you have so the point is you can go bam straight to the solution of your leas squares um in other words squeeze the maximum juice out of your data with your linear function approximator find the best weights um and sometimes this is better than experience replay like if you've got small number of features um but some complicated um you know history of a lot of experience this might be cheaper to do if you've got a lot of features it might actually be better to do experience replay you might get closer to the solution by doing some amount of experience replay you know maybe uh I've had a lot of experiments where you know just after five or six repeats over your experience again you already get close to the you extract most of the juice out of the data you've seen um and so you can really get an awful lot with experience replay but this is also because bam in one go you get right there but you know specific to linear function approximation okay so as I said we can plug in multical returns we can plug in TD targets you can plug in the Lambda return um and you end up exactly the same Updates this is just plugging in the same idea of making the the update equal to zero overall and then solving for that update you end up with algorithms called Le squares Monte Carlo Le squares temporal difference learning and lstd Lambda um and so how do these fair in terms of convergence well they actually have better convergence properties than the incremental ones so u in particular um so they can't be applied with nonlinear function approximators but um they do converge both on and off policy to the right answer so they have less issues compared to TD with this off policy case they bam get you the right answer doesn't matter if you're learning off policy um and then finally I just want to say that how do you actually use this in practice well you apply an algorithm called lspi lease squares policy iteration and again we're back to our familiar approach to doing control like if you really want to solve the control so far we've just talked about how to do policy evaluation we solve for the lease squares fit to our current policy but what if you want to use that to do control well now we're just going to plug this same idea in again we're going to replace our policy evaluation step by a lease squares um approach to learning the action value function so we want to find a Le squares fit to Q if we replace our with Q um and now we can basically come back to this approach here we're going to start off with some some weights for our linear combination of features act gly with respect to our q and then do our least squares fit to all of the data we've seen so far um generate um a new Q value act regly do leas squares fit to all of the data we've seen so far and iterate and so forth um if you do this in the right way it converges to the optimal value function and the optimal policy um so there's a few more details in the slides which roughly just outline a little more detail on this algorithm it's called Le squares policy iteration um and show that this also has you still have some issues I mean when you move to control this still can chatter so it's not like this solves all the convergence issues um and then I just want to show one final example which is like this random walk type example with some more complicated Dynamics so this is a problem with like 50 states um and they're just um 50 States and it's like a replica of this diagram these different Dynamics over 50 states and now we just say there's two good states that we want to get to so there's a reward of plus one if you reach State 10 or state 41 otherwise no reward um and we're going to use function approximation with some some kind of um radial basis functions some Gan spaced um around kind of generalizing across some of these states so we kind of um we don't have a separate value for each of these states we're going to generalize across them um and now we're going to apply lspi to this chain Mar and you see in just a small number of iterations so this is only seven iterations it's finding the um the optimal value function and hence the optimal policy so here you see this is basically showing the state space along this axis and so you see here here's the good state state 10 is the good State um State um 41 or whatever it was um and it very quickly picks out the shape of this so the blue and the red um represent the two different actions of going um left or going right um and one um one and so the other parameter of Dash versus solid is whether it's the True Value function or the approximate value function and you see that it very quickly figures out the perfect shape to this thing um and it's approximate it's not like a sharp Peak because the value spreads out a bit and also because we're using function approximation it very quickly gets the structure you know evaluates its starting policy ends up with a shape like this acts greedy with respect to it um ends up evaluating that new policy gets a shape that looks something like this improves its policy with respect to that um ends up with some um you know new uh value function and so forth and after just a small number of iterations it arrives at the the right answer um and hence the right policy this shows you whether it chooses to go left or right and you see that by iteration seven it's already making the right decision even earlier impacted gets the right decisions in this space where red is right and blue is left in this case um so that's it all done um next time we'll talk about policy based methods for scaling up using function approximation um and see you next time
Original Description
#Reinforcement Learning Course by David Silver# Lecture 6: Value Function Approximation
#Slides and more info about the course: http://goo.gl/vUiyjq
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Google DeepMind · Google DeepMind · 5 of 60
1
2
3
4
▶
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
RL Course by David Silver - Lecture 8: Integrating Learning and Planning
Google DeepMind
RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning
Google DeepMind
RL Course by David Silver - Lecture 2: Markov Decision Process
Google DeepMind
RL Course by David Silver - Lecture 5: Model Free Control
Google DeepMind
RL Course by David Silver - Lecture 6: Value Function Approximation
Google DeepMind
RL Course by David Silver - Lecture 4: Model-Free Prediction
Google DeepMind
RL Course by David Silver - Lecture 3: Planning by Dynamic Programming
Google DeepMind
RL Course by David Silver - Lecture 10: Classic Games
Google DeepMind
RL Course by David Silver - Lecture 7: Policy Gradient Methods
Google DeepMind
Google DeepMind: Ground-breaking AlphaGo masters the game of Go
Google DeepMind
Match 1 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
Match 2 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
Match 1 15 min Summary - Google DeepMind Challenge Match
Google DeepMind
Match 3 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
Match 2 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
Match 3 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
Match 4 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
Match 4 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
Match 5 - Google DeepMind Challenge Match: Lee Sedol vs AlphaGo
Google DeepMind
Match 5 15 Minute Summary - Google DeepMind Challenge Match 2016
Google DeepMind
DQN SPACE INVADERS
Google DeepMind
DQN Breakout
Google DeepMind
Asynchronous Methods for Deep Reinforcement Learning: Labyrinth
Google DeepMind
Asynchronous Methods for Deep Reinforcement Learning: MuJoCo
Google DeepMind
Asynchronous Methods for Deep Reinforcement Learning: TORCS
Google DeepMind
Differentiable neural computer family tree inference task
Google DeepMind
StarCraft II DeepMind feature layer API
Google DeepMind
DeepMind Health – Partnership with the Royal Free London NHS Foundation Trust
Google DeepMind
DeepMind Health – Michael Wise – a patient's journey
Google DeepMind
Streams – a platform for a digital NHS
Google DeepMind
DeepMind Lab - Nav Maze Level 1
Google DeepMind
DeepMind Lab - Stairway to Melon Level
Google DeepMind
DeepMind Lab - Laser Tag Space Bounce Level (Hard)
Google DeepMind
Exploring the mysteries of Go with AlphaGo and China's top players
Google DeepMind
Demis Hassabis on AlphaGo: its legacy and the 'Future of Go Summit' in Wuzhen, China
Google DeepMind
The Future of Go Summit: AlphaGo & Ke Jie match 1 moves analysis
Google DeepMind
The Future of Go Summit: AlphaGo & Ke Jie match 2 moves analysis
Google DeepMind
The Future of Go Summit: Pair Go moves analysis
Google DeepMind
The Future of Go Summit: AlphaGo & Ke Jie match 3 moves analysis
Google DeepMind
Emergence of Locomotion Behaviours in Rich Environments
Google DeepMind
StarCraft II 'mini games' for AI research
Google DeepMind
Trained and untrained agents play StarCraft II full 1vs1 game
Google DeepMind
DeepMind open source PySC2 toolset for Starcraft II
Google DeepMind
ICML 2017: Test of Time Award (Sylvain Gelly & David Silver)
Google DeepMind
Ke Jie and DeepMind's Go Ambassador Fan Hui review the 3rd AlphaGo vs Ke Jie game
Google DeepMind
Ke Jie and DeepMind's Go Ambassador Fan Hui review the 1st AlphaGo vs Ke Jie game
Google DeepMind
Ke Jie and DeepMind's Go Ambassador Fan Hui review the 2nd AlphaGo vs Ke Jie game
Google DeepMind
AlphaGo Zero: Discovering new knowledge
Google DeepMind
AlphaGo Zero: Starting from scratch
Google DeepMind
Defining principles for tech companies in the NHS: DeepMind Health's Collaborative Listening Summit
Google DeepMind
A systems neuroscience approach to building AGI - Demis Hassabis, Singularity Summit 2010
Google DeepMind
Retour de Rémi Munos en France et ouverture de DeepMind Paris
Google DeepMind
Grid cells - Caswell Barry, UCL
Google DeepMind
DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: What our research shows
Google DeepMind
DeepMind Health Research and Moorfields Eye Hospital NHS Foundation Trust: A Patient's Story
Google DeepMind
Deep Learning 3: Neural Networks Foundations
Google DeepMind
Deep Learning 5: Optimization for Machine Learning
Google DeepMind
Deep Learning 8: Unsupervised learning and generative models
Google DeepMind
Reinforcement Learning 1: Introduction to Reinforcement Learning
Google DeepMind
Deep Learning 2: Introduction to TensorFlow
Google DeepMind
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Medium · Machine Learning
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Dev.to AI
Notes: Memory, Context, and Large Language Models (LLMs)
Dev.to · Vladimir Panov
🎓
Tutor Explanation
DeepCamp AI