Dynamic Programming - Reinforcement Learning Chapter 4

Connor Shorten · Beginner ·🧠 Large Language Models ·6y ago

Key Takeaways

The video series covers Reinforcement Learning, specifically Dynamic Programming and Generalized Policy Iteration, using techniques such as In-place Dynamic Programming, Asynchronous Dynamic Programming, and Bellman Optimality Equations, with examples from the grid world and car-rental problem.

Full Transcript

[Music] thanks for watching Henry AI Labs this video will cover dynamic programming chapter 4 in an introduction to reinforcement learning by Richard Sutton and Andrew Bartow this video is a part of a series going through this book chapter by chapter explaining some of the key concepts and ideas a free pdf version of this book is linked in the description as well as a print version if you want to buy it although chapter four is titled dynamic programming the key takeaway is pictured here generalized policy iteration compared to chapter three where we used bellman optimality equations to derive a system of equations to solve for the optimal value functions the generalized policy iteration is a technique of iteratively approximating and getting closer and closer to the optimal value functions and this is much more computationally tractable especially when you have large state spaces so we'll also talk about components of dynamic programming like in-place dynamic programming to save you memory in your value function state table and then asynchronous dynamic programming where we update just a subset of the state space compared to doing an entire state space suite of the value functions which is totally impossible for things like backgammon with 10 to 20 States or chess or many of these other reinforce learning problems were interested in throughout the chapter will motivate these examples with a concrete resource allocation example imagining that we have a car rental business and we are making decisions or actions to move cars between different locations and a number of cars that each rental location represents the states of the environment before we get into the generalized policy iteration let's quickly recap how we solve for the optimal state values in the previous chapter chapter 3 so what we did in this case is we use the bellman optimality equations to look ahead to the future states and then use the reward that we achieve at the current state and then add it to the discount factor times the expectation of the next states that we can achieve so from hi we have the option of either searching and receiving this reward for searching and then we have the discount factor times the value estimate of the next state as well as if we wait and return to this state as well so in this case we're able to derive this system of equations which we can solve relatively easily because we only have two states and a maximum of three actions at the low state but when we have large state spaces like in our car rental allocation problem or chess or backgammon or many many of these reinforcement learning problems it's really not possible to solve this system of equations explicitly generalized palsy iteration is an iterative approximation algorithm which is one of the most popular classes of algorithms in computer science and machine learning so the way that this algorithm works is we randomly initialize our value function estimates of every state and start with a random policy then we proceed by evaluating the values of every state with respect to this random policy and then we update our policy by making greedy action choices respect to the value functions of this policy so the way the greedy action selection works is basically you just choose the action which takes you to the state that has the highest value estimate for that state so overall this generalized policy iteration algorithm loops itself until it converges to the optimal value function and the optimal policy and this leads to a sequence of policies and n value estimates with respect to that policy such as this having an evaluation of the policy than an improvement of the policy with respect to the value estimate and then evaluate improvement until it converges to the optimal policy and value function the generalized policy iteration algorithm is probably best understood with a concrete grid world example so in this case we have this grid world where we receive a -1 reward for every state except for these gray shaded boxes which is where we receive 0 reward so we initialize our policy to be making random up right down left decisions all with the same probability and then we initialize our value estimates by placing minus 1 everywhere but 0 on the terminal States so already we have this value function here and we can see here how we make the policy greedy with respect to the value function estimates so to see how in this state we make the decision to move left because we're moving from this minus 1 state we can either move to another minus 1 State this right is minus one state or we can go left which is the 0 state so this is what it means to make the policy greedy with respect to the current value estimates of the states so now that we have this new policy where this state is is leading you into a 0 reward we can now estimate it with the bellman equations which leaves this to being minus 1.7 rather than minus 2 of its neighbors and we repeat this until we reach the optimal policy and the optimal way of moving towards one of these shaded boxes given whichever state you're currently in now that we've seen the grid world example let's formally walk through the pseudocode of the generalized policy iteration algorithm so we initially initialized our value estimates of every state and then we have some random policy and then we loop through this evaluating the policy and then improving the policy so the way that we evaluate the policy is with our bellman equation lookup where we have the MDP dynamics of the transition probabilities and the rewards and then we use the discount factor e times the estimate of the next state that we're leading to with the action that we're taking with and the policy the policy parameterize by s here is giving you the action that the policy is making you take is mapping you from this state to this action given PI parameterize by s so then what we do is we have this Delta to stop our policy evaluation because we see here that there's an inner loop within our policy evaluation so even in here we're looping through the V of s to make it converge to some value function estimates of the states given our current policy so if this Delta meaning that the update to the value each state isn't that large then we decide that we've successfully evaluated the policy and then we return now to improving the policy by making it greedy and the way that we make it greedy is by using this Arg max a meaning that select the action which takes you to the S prime state that has the highest reward with respect to the value function estimates that were just made in step two when we're improving the policy we're selecting the action that is going to take us to the next state with the highest value estimate according to the value function estimate that we just did in step two of the generalized policy improvement algorithm so this proof and this derivation is showing that if you update your policy to PI prime for this current state then it's going to be better than the original policy PI for all states because it is the same policy in every other way and taking this greedy decision is always going to make it better and not works and this is an important property of the convergence of the GPI algorithm next we'll motivate our dynamic programming and our generalized policy iteration in the context of assigning resources and the car-rental problem so the states in this example represent a number of cars at each of the locations and the actions are these decisions to move cars from location a to location B and then there's some random variables in the environment that determine things like whether the cars are returned to the location at the end of the rental and then what the demand of the customers is for cars at say time interval like a day so this progression shows how the policy is improving in the GPI so originally the policy is just making random decisions the zero denoting that it doesn't move cars from location to location at all and then it starts to evaluate the values of you know which distribution of cars at location a and location B you have so you can see in the converged policy in the case when you have 20 cars in each location this is optimal you don't move any cars anywhere but if you're in this case where you have 20 cars in the first location and then zero at the second you would move five from the first a second which you know it only makes sense so we see this progression of the policy and I think that's the most important thing to take away from visualizing this example of seeing how this iterative policy evaluation algorithm leads to this convergent policy on the resource allocation problem so one of the interesting characteristics of this car rental problem compared to the recycling robot MDP that we've seen in the previous chapter is that we have a large state space we can either have zero cars and a zero cars and B all the way up to 20 cars in a and 20 cars of B and then we have this contour map depicting our policy decisions sort of like granular eyes in each of these steps is would be a very like tedious diagram so rather we have this contour map that shows a similar policy for different states another interesting detail of the car rental MVP problem and the example that we're using for resource allocation is that we have a different method of stochastic transitions and rewards so we saw in the recycling robot how we have the Alpha and one minus Alpha and the beta 1 minus beta of transitioning from state to state given a certain action but in this case not only do we have this stochastic transition it's parameterised in a more complex way so the way that the returns are the cars and then the demand of like how many cars are rented from each location are given by these posts on random variable distribution so it's a little more complex than just having an alpha 1 minus alpha sort of probability distribution with respect to how the Markov decision process transit transitions from state to state given a certain action additionally we have a stochastic reward variable now as well because the demand is a random variable that isn't going to be the same every single time in every single state so this is just contrasting to our recycling robot in which the transitions are much more straightforward and the reward is deterministic we have the same reward for every action in this and aside like toy problem to motivate MVPs and understand this framework will conclude discussing the generalized policy iteration algorithm with a couple of interesting characteristics about it the first thing is this interesting tug-of-war between the evaluation and improvement because every time we evaluate the policy and update our value functions then it makes the policy no longer greedy and then every time we make the policy greedy with respect to that value function the value function estimate is no longer correct another interesting characteristic is our policy evaluation loop so there's an interesting distinction in the book between policy iteration where we loop through several times until we hit this Delta threshold and then exit out at the evaluation loop compared to something called value iteration where you only make one loop through this value update despite whether it is completely converged to be correct state estimates but the book gives you some convergence guarantees for why value iteration will still work in the GPI framework now we'll transition our presentation of the chapter from the generalized policy iteration algorithm to some characteristics of dynamic programming so dynamic programming is a technique in computer science frequently used for things like genetic sequence alignment or like spell checking basically what you're doing is you're bootstrapping the computation computation by using your estimate of V sub k plus 1 to estimate the state function with V sub K so the in-place dynamic programming basically refers to this idea of not using a separate array that holds the previous value function estimates at time step K so what we would do rather is we update these state values in a single sweep with our bellman equations and this might cause a little bit of noise in the updates because you know obviously as you update this state to is it might have like a higher magnitude of updating state R you can imagine what the grid world example how you traverse the grid and then how you update the values of every state is totally dependent on the previous updates another interesting extension to dynamic programming that makes it applicable to you know interesting reinforcement learning problems like chess and backgammon is this idea of a synchronous dynamic programming so in backgammon we have 10 to the 20 states so the book mode it's this problem by saying if you were to do a complete state space sweep and update the value pushed estimates of all 10 to 20 states that would take 1,000 years at a speed of 1 million State updates per second so asynchronous programming dynamic programming refers this idea of updating the value estimates of a subset of States rather than the entire set of states at every policy evaluation iteration so asynchronous dynamic programming can also be distributed across machines so we can imagine sending say this top left half of the states and the value estimates to Machine 1 this top right half order to machine - and then you know distributing state spaces in this way and then you might have sort of a noisy value function update but the book gives a more concrete example for how this will converge the chapter also presents another example of generalized policy iteration in this case you are betting your your capital on a coin flip with a probability of P sub H equals 0.4 and then you're making this policy where you're deciding how much you want to bet given each amount of capital so in this top chart you see how the value estimate of the states are converging with respect to the iterative sweeps through the GPI algorithm and then you see sort of an interesting policy where if you're at 50 you bet all of your money and if you're at like 51 you would just bet a little bit see what the the policy is the mapping from the capital state to the action being how much you bet on each coin flip so this is an interesting little policy to see and then this chart here is interesting to see a visualization of the convergence of the GPI algorithm thanks for watching this explanation of dynamic programming chapter 4 in the book an introduction to reinforce and learning by Richard Sutton and Andrew Bartow hopefully from this video you took away the idea of generalized policy iteration and characteristics of dynamic programming like in place memory tables and asynchronous updates please stay tuned for chapters 5 through 17 in this series on an introduction to reinforcement learning and please subscribe to Henry AI labs for more deep learning and artificial intelligence video

Original Description

Free PDF: http://incompleteideas.net/book/RLbook2018.pdf Print Version: https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation/dp/0262039249/ref=dp_ob_title_bk Thanks for watching this series going through the Introduction to Reinforcement Learning book! I think this is the best book for learning RL and hopefully these videos can help shed light on some of the topics as you read through it yourself! Thanks for watching! Please Subscribe!
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 0 of 60

← Previous Next →
1 DenseNets
DenseNets
Connor Shorten
2 DeepWalk Explained
DeepWalk Explained
Connor Shorten
3 Inception Network Explained
Inception Network Explained
Connor Shorten
4 StackGAN
StackGAN
Connor Shorten
5 StyleGAN
StyleGAN
Connor Shorten
6 Progressive Growing of GANs Explained
Progressive Growing of GANs Explained
Connor Shorten
7 Improved Techniques for Training GANs
Improved Techniques for Training GANs
Connor Shorten
8 Word2Vec Explained
Word2Vec Explained
Connor Shorten
9 Must Read Papers on GANs
Must Read Papers on GANs
Connor Shorten
10 Unsupervised Feature Learning
Unsupervised Feature Learning
Connor Shorten
11 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
12 Embedding Graphs with Deep Learning
Embedding Graphs with Deep Learning
Connor Shorten
13 Transfer Learning in GANs
Transfer Learning in GANs
Connor Shorten
14 ReLU Activation Function
ReLU Activation Function
Connor Shorten
15 AC-GAN Explained
AC-GAN Explained
Connor Shorten
16 SimGAN Explained
SimGAN Explained
Connor Shorten
17 DC-GAN Explained!
DC-GAN Explained!
Connor Shorten
18 ResNet Explained!
ResNet Explained!
Connor Shorten
19 Graph Convolutional Networks
Graph Convolutional Networks
Connor Shorten
20 Neural Architecture Search
Neural Architecture Search
Connor Shorten
21 Henry AI Labs
Henry AI Labs
Connor Shorten
22 Video Classification with Deep Learning
Video Classification with Deep Learning
Connor Shorten
23 BigGANs in Data Augmentation
BigGANs in Data Augmentation
Connor Shorten
24 Introduction to Deep Learning
Introduction to Deep Learning
Connor Shorten
25 EfficientNet Explained!
EfficientNet Explained!
Connor Shorten
26 Self-Attention GAN
Self-Attention GAN
Connor Shorten
27 Curriculum Learning in Deep Neural Networks
Curriculum Learning in Deep Neural Networks
Connor Shorten
28 Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
29 Deep Compression
Deep Compression
Connor Shorten
30 Skin Cancer Classification with Deep Learning
Skin Cancer Classification with Deep Learning
Connor Shorten
31 Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
32 The Lottery Ticket Hypothesis Explained!
The Lottery Ticket Hypothesis Explained!
Connor Shorten
33 SqueezeNet
SqueezeNet
Connor Shorten
34 GauGAN Explained!
GauGAN Explained!
Connor Shorten
35 AutoML with Hyperband
AutoML with Hyperband
Connor Shorten
36 DL Podcast #3 | Yannic Kilcher | Population-Based Search
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
37 Weakly Supervised Pretraining
Weakly Supervised Pretraining
Connor Shorten
38 Image Data Augmentation for Deep Learning
Image Data Augmentation for Deep Learning
Connor Shorten
39 Unsupervised Data Augmentation
Unsupervised Data Augmentation
Connor Shorten
40 Wide ResNet Explained!
Wide ResNet Explained!
Connor Shorten
41 RevNet: Backpropagation without Storing Activations
RevNet: Backpropagation without Storing Activations
Connor Shorten
42 GANs with Fewer Labels
GANs with Fewer Labels
Connor Shorten
43 BigBiGAN Unsupervised Learning!
BigBiGAN Unsupervised Learning!
Connor Shorten
44 Self-Supervised Learning
Self-Supervised Learning
Connor Shorten
45 Multi-Task Self-Supervised Learning
Multi-Task Self-Supervised Learning
Connor Shorten
46 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
47 Population Based Training
Population Based Training
Connor Shorten
48 Show, Attend and Tell
Show, Attend and Tell
Connor Shorten
49 Siamese Neural Networks
Siamese Neural Networks
Connor Shorten
50 WaveGAN Explained!
WaveGAN Explained!
Connor Shorten
51 VAE-GAN Explained!
VAE-GAN Explained!
Connor Shorten
52 Evolution in Neural Architecture Search!
Evolution in Neural Architecture Search!
Connor Shorten
53 AI Research Weekly Update August 18th, 2019
AI Research Weekly Update August 18th, 2019
Connor Shorten
54 Weight Agnostic Neural Networks Explained!
Weight Agnostic Neural Networks Explained!
Connor Shorten
55 AI Research Weekly Update August 25th, 2019
AI Research Weekly Update August 25th, 2019
Connor Shorten
56 Neuroevolution of Augmenting Topologies (NEAT)
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
57 CoDeepNEAT
CoDeepNEAT
Connor Shorten
58 AI Research Weekly Update September 1st, 2019
AI Research Weekly Update September 1st, 2019
Connor Shorten
59 Randomly Wired Neural Networks
Randomly Wired Neural Networks
Connor Shorten
60 Genetic CNN
Genetic CNN
Connor Shorten

This video series covers the basics of Reinforcement Learning, specifically Dynamic Programming and Generalized Policy Iteration, with examples and applications to real-world problems. The series provides a comprehensive introduction to the topic, including the use of In-place Dynamic Programming, Asynchronous Dynamic Programming, and Bellman Optimality Equations.

Key Takeaways
  1. Initialize policy and value estimates for the grid world
  2. Update policy and value estimates using the Bellman equation
  3. Converge to the optimal policy and value function
  4. Motivate dynamic programming and generalized policy iteration in the context of assigning resources and the car-rental problem
  5. Apply Dynamic Programming to Reinforcement Learning problems
  6. Implement Generalized Policy Iteration algorithm
💡 The Generalized Policy Iteration algorithm converges to an optimal policy in the car-rental problem, and Dynamic Programming can be used to solve complex Reinforcement Learning problems.

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →