Dynamic Programming - Reinforcement Learning Chapter 4
Key Takeaways
The video series covers Reinforcement Learning, specifically Dynamic Programming and Generalized Policy Iteration, using techniques such as In-place Dynamic Programming, Asynchronous Dynamic Programming, and Bellman Optimality Equations, with examples from the grid world and car-rental problem.
Full Transcript
[Music] thanks for watching Henry AI Labs this video will cover dynamic programming chapter 4 in an introduction to reinforcement learning by Richard Sutton and Andrew Bartow this video is a part of a series going through this book chapter by chapter explaining some of the key concepts and ideas a free pdf version of this book is linked in the description as well as a print version if you want to buy it although chapter four is titled dynamic programming the key takeaway is pictured here generalized policy iteration compared to chapter three where we used bellman optimality equations to derive a system of equations to solve for the optimal value functions the generalized policy iteration is a technique of iteratively approximating and getting closer and closer to the optimal value functions and this is much more computationally tractable especially when you have large state spaces so we'll also talk about components of dynamic programming like in-place dynamic programming to save you memory in your value function state table and then asynchronous dynamic programming where we update just a subset of the state space compared to doing an entire state space suite of the value functions which is totally impossible for things like backgammon with 10 to 20 States or chess or many of these other reinforce learning problems were interested in throughout the chapter will motivate these examples with a concrete resource allocation example imagining that we have a car rental business and we are making decisions or actions to move cars between different locations and a number of cars that each rental location represents the states of the environment before we get into the generalized policy iteration let's quickly recap how we solve for the optimal state values in the previous chapter chapter 3 so what we did in this case is we use the bellman optimality equations to look ahead to the future states and then use the reward that we achieve at the current state and then add it to the discount factor times the expectation of the next states that we can achieve so from hi we have the option of either searching and receiving this reward for searching and then we have the discount factor times the value estimate of the next state as well as if we wait and return to this state as well so in this case we're able to derive this system of equations which we can solve relatively easily because we only have two states and a maximum of three actions at the low state but when we have large state spaces like in our car rental allocation problem or chess or backgammon or many many of these reinforcement learning problems it's really not possible to solve this system of equations explicitly generalized palsy iteration is an iterative approximation algorithm which is one of the most popular classes of algorithms in computer science and machine learning so the way that this algorithm works is we randomly initialize our value function estimates of every state and start with a random policy then we proceed by evaluating the values of every state with respect to this random policy and then we update our policy by making greedy action choices respect to the value functions of this policy so the way the greedy action selection works is basically you just choose the action which takes you to the state that has the highest value estimate for that state so overall this generalized policy iteration algorithm loops itself until it converges to the optimal value function and the optimal policy and this leads to a sequence of policies and n value estimates with respect to that policy such as this having an evaluation of the policy than an improvement of the policy with respect to the value estimate and then evaluate improvement until it converges to the optimal policy and value function the generalized policy iteration algorithm is probably best understood with a concrete grid world example so in this case we have this grid world where we receive a -1 reward for every state except for these gray shaded boxes which is where we receive 0 reward so we initialize our policy to be making random up right down left decisions all with the same probability and then we initialize our value estimates by placing minus 1 everywhere but 0 on the terminal States so already we have this value function here and we can see here how we make the policy greedy with respect to the value function estimates so to see how in this state we make the decision to move left because we're moving from this minus 1 state we can either move to another minus 1 State this right is minus one state or we can go left which is the 0 state so this is what it means to make the policy greedy with respect to the current value estimates of the states so now that we have this new policy where this state is is leading you into a 0 reward we can now estimate it with the bellman equations which leaves this to being minus 1.7 rather than minus 2 of its neighbors and we repeat this until we reach the optimal policy and the optimal way of moving towards one of these shaded boxes given whichever state you're currently in now that we've seen the grid world example let's formally walk through the pseudocode of the generalized policy iteration algorithm so we initially initialized our value estimates of every state and then we have some random policy and then we loop through this evaluating the policy and then improving the policy so the way that we evaluate the policy is with our bellman equation lookup where we have the MDP dynamics of the transition probabilities and the rewards and then we use the discount factor e times the estimate of the next state that we're leading to with the action that we're taking with and the policy the policy parameterize by s here is giving you the action that the policy is making you take is mapping you from this state to this action given PI parameterize by s so then what we do is we have this Delta to stop our policy evaluation because we see here that there's an inner loop within our policy evaluation so even in here we're looping through the V of s to make it converge to some value function estimates of the states given our current policy so if this Delta meaning that the update to the value each state isn't that large then we decide that we've successfully evaluated the policy and then we return now to improving the policy by making it greedy and the way that we make it greedy is by using this Arg max a meaning that select the action which takes you to the S prime state that has the highest reward with respect to the value function estimates that were just made in step two when we're improving the policy we're selecting the action that is going to take us to the next state with the highest value estimate according to the value function estimate that we just did in step two of the generalized policy improvement algorithm so this proof and this derivation is showing that if you update your policy to PI prime for this current state then it's going to be better than the original policy PI for all states because it is the same policy in every other way and taking this greedy decision is always going to make it better and not works and this is an important property of the convergence of the GPI algorithm next we'll motivate our dynamic programming and our generalized policy iteration in the context of assigning resources and the car-rental problem so the states in this example represent a number of cars at each of the locations and the actions are these decisions to move cars from location a to location B and then there's some random variables in the environment that determine things like whether the cars are returned to the location at the end of the rental and then what the demand of the customers is for cars at say time interval like a day so this progression shows how the policy is improving in the GPI so originally the policy is just making random decisions the zero denoting that it doesn't move cars from location to location at all and then it starts to evaluate the values of you know which distribution of cars at location a and location B you have so you can see in the converged policy in the case when you have 20 cars in each location this is optimal you don't move any cars anywhere but if you're in this case where you have 20 cars in the first location and then zero at the second you would move five from the first a second which you know it only makes sense so we see this progression of the policy and I think that's the most important thing to take away from visualizing this example of seeing how this iterative policy evaluation algorithm leads to this convergent policy on the resource allocation problem so one of the interesting characteristics of this car rental problem compared to the recycling robot MDP that we've seen in the previous chapter is that we have a large state space we can either have zero cars and a zero cars and B all the way up to 20 cars in a and 20 cars of B and then we have this contour map depicting our policy decisions sort of like granular eyes in each of these steps is would be a very like tedious diagram so rather we have this contour map that shows a similar policy for different states another interesting detail of the car rental MVP problem and the example that we're using for resource allocation is that we have a different method of stochastic transitions and rewards so we saw in the recycling robot how we have the Alpha and one minus Alpha and the beta 1 minus beta of transitioning from state to state given a certain action but in this case not only do we have this stochastic transition it's parameterised in a more complex way so the way that the returns are the cars and then the demand of like how many cars are rented from each location are given by these posts on random variable distribution so it's a little more complex than just having an alpha 1 minus alpha sort of probability distribution with respect to how the Markov decision process transit transitions from state to state given a certain action additionally we have a stochastic reward variable now as well because the demand is a random variable that isn't going to be the same every single time in every single state so this is just contrasting to our recycling robot in which the transitions are much more straightforward and the reward is deterministic we have the same reward for every action in this and aside like toy problem to motivate MVPs and understand this framework will conclude discussing the generalized policy iteration algorithm with a couple of interesting characteristics about it the first thing is this interesting tug-of-war between the evaluation and improvement because every time we evaluate the policy and update our value functions then it makes the policy no longer greedy and then every time we make the policy greedy with respect to that value function the value function estimate is no longer correct another interesting characteristic is our policy evaluation loop so there's an interesting distinction in the book between policy iteration where we loop through several times until we hit this Delta threshold and then exit out at the evaluation loop compared to something called value iteration where you only make one loop through this value update despite whether it is completely converged to be correct state estimates but the book gives you some convergence guarantees for why value iteration will still work in the GPI framework now we'll transition our presentation of the chapter from the generalized policy iteration algorithm to some characteristics of dynamic programming so dynamic programming is a technique in computer science frequently used for things like genetic sequence alignment or like spell checking basically what you're doing is you're bootstrapping the computation computation by using your estimate of V sub k plus 1 to estimate the state function with V sub K so the in-place dynamic programming basically refers to this idea of not using a separate array that holds the previous value function estimates at time step K so what we would do rather is we update these state values in a single sweep with our bellman equations and this might cause a little bit of noise in the updates because you know obviously as you update this state to is it might have like a higher magnitude of updating state R you can imagine what the grid world example how you traverse the grid and then how you update the values of every state is totally dependent on the previous updates another interesting extension to dynamic programming that makes it applicable to you know interesting reinforcement learning problems like chess and backgammon is this idea of a synchronous dynamic programming so in backgammon we have 10 to the 20 states so the book mode it's this problem by saying if you were to do a complete state space sweep and update the value pushed estimates of all 10 to 20 states that would take 1,000 years at a speed of 1 million State updates per second so asynchronous programming dynamic programming refers this idea of updating the value estimates of a subset of States rather than the entire set of states at every policy evaluation iteration so asynchronous dynamic programming can also be distributed across machines so we can imagine sending say this top left half of the states and the value estimates to Machine 1 this top right half order to machine - and then you know distributing state spaces in this way and then you might have sort of a noisy value function update but the book gives a more concrete example for how this will converge the chapter also presents another example of generalized policy iteration in this case you are betting your your capital on a coin flip with a probability of P sub H equals 0.4 and then you're making this policy where you're deciding how much you want to bet given each amount of capital so in this top chart you see how the value estimate of the states are converging with respect to the iterative sweeps through the GPI algorithm and then you see sort of an interesting policy where if you're at 50 you bet all of your money and if you're at like 51 you would just bet a little bit see what the the policy is the mapping from the capital state to the action being how much you bet on each coin flip so this is an interesting little policy to see and then this chart here is interesting to see a visualization of the convergence of the GPI algorithm thanks for watching this explanation of dynamic programming chapter 4 in the book an introduction to reinforce and learning by Richard Sutton and Andrew Bartow hopefully from this video you took away the idea of generalized policy iteration and characteristics of dynamic programming like in place memory tables and asynchronous updates please stay tuned for chapters 5 through 17 in this series on an introduction to reinforcement learning and please subscribe to Henry AI labs for more deep learning and artificial intelligence video
Original Description
Free PDF: http://incompleteideas.net/book/RLbook2018.pdf
Print Version: https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation/dp/0262039249/ref=dp_ob_title_bk
Thanks for watching this series going through the Introduction to Reinforcement Learning book! I think this is the best book for learning RL and hopefully these videos can help shed light on some of the topics as you read through it yourself!
Thanks for watching! Please Subscribe!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro
Dev.to · Stanislav
How I'm re-discovering computer science with LLM revolution
Dev.to · popiol
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · AI
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · ChatGPT
🎓
Tutor Explanation
DeepCamp AI