DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany · Beginner ·📄 Research Papers Explained ·5y ago

Key Takeaways

The video explains the paper 'Playing Atari with Deep Reinforcement Learning' which introduced the DQN or deep Q-network, a model-free reinforcement learning algorithm that uses a CNN to learn control policies directly from high-dimensional sensory input using RL. The algorithm applies Q-learning to estimate future rewards and outperforms previous approaches on six of the seven Atari games, surpassing human experts on three of them.

Full Transcript

what's up folks in this video i'm going to cover the paper that caused the imagenet movement for the reinforcement learning field the dqn paper also known as the playing atari with the reinforcement learning paper uh by vladimir and his collaborator at deepmind and i hopefully pronounce his name right um so let's let's dig into it uh we present the first deep learning model to successfully learn control policies uh directly from high dimensional sensory input i images in this case using rl the model is a cnn train with a variant of q learning and we'll see what that is in a moment whose input is raw pixels and the output is a value function actually the action value function or the q function estimating the future rewards we apply our method to seven atari games and they don't adjust the architecture or the learning algorithm which makes this algorithm much more general than all of the previous ones we find that it outperforms all previous approaches on six of the games and even surprises surpasses a human expert on three of them so uh before i start digging into the paper i'll just give you a rough sketch so that you have a like a like a high level understanding of all of the components and then we'll start going into a lot of details as i usually do in my paper overviews so okay let's let's start so the first thing is first the the most general uh like framework of how rl problem is constructed is the following so we have an agent and you'll see this all over the place and so we when we have environment and basically the agent uh sends some action to the environment and the environment gives back some observation let's call it oh and also some reward and now the whole goal of this agent in this environment is to maximize the future expected reward and that's pretty much it now this can mean any like every single thing you can imagine can be constructed like this pretty much uh many things maybe not everything so let me just concretize this for the specific problem we are we're coping with in this paper and that's atari so in our case uh what will happen is the following so agent is actually a cnn so we have a deep convolutional neural network not so deep you'll see soon it was actually only two conv layers and um so what we do is the following so uh we interact with the environment in the following way so the environment is the atari emulator so let's just draw it like like a sunscreen and what we do is we in the case so just just a small digression uh here are the games in atari so basically um here are five examples we have pong we have something called breakout space invaders seaquest and beam rider and hopefully you're you're at least familiar with some of these uh maybe pong is the easiest example so you basically have two pedals and you have a ball here and you're trying to not to miss the ball you're trying to to to be the opponent by just uh shooting the balls of pass through the opponent and once you do that you receive the reward one uh if you if the opponent wins uh he gets reward of one and you get just zero so as you can see the problem is that the reward is very sparse you have to play for some time in order to get one and that's it we don't have as in supervised learning in every single step maybe when you do when you're doing image classification for every single image you have a label and that's how you train so here the reward is sparse it's delayed and you're not sure whether the award will come at all if you're having if you have in the beginning especially when you have a random policy you'll be getting zero rewards for a long long time so uh having that context in mind so this is the image we get in pong this is the rewards zeros or ones in pong let me get back to the thing i was trying to explain what if i can find it okay okay so basically what happens is the following in in the pawn game we'll have three actions we'll have up we'll have uh no up basically just uh don't move and we have the down action okay just three simple actions and the state will be the following thing the state will be not the single image because unfortunately the single image is ambiguous because for example if you have an image and you have a ball here you don't know where the ball is going to go in this direction where it's going to go in this direction maybe even in the opposite direction here so you need to have at least two frames to have the temporal information so that you can solve this task so because of that what they actually did is they represented the state after doing some pre-processing on images we'll see that later they just lump four frames together so like this one two three four whatever and that's a state this is a state and this state goes into the cnn and that's our agent and it outputs the actions so once we put the state into the cnn we get some numbers here and these are the action values and basically maybe if the up value was eight and this one was two and this one was maybe minus one with the best thing to do would be to pick eight because that promises the biggest expected reward in the future uh if we're following the greedy policy and so we pick eight and that's up we send that to the emulator the emulator does it thing it just update renders new screen and sends it back so it sends sends us back the new frame and some reward or maybe zero reward if nothing happens if the game still hasn't finished and what we do is the following so we treat this as a circular buffer so basically what we do is we just uh take the the last frame in this state we just uh get rid of it and we we append uh to the front of the of this queue the newest frame and that's our new state so once the model is trained and i'll skip the training part for a second let's see what happens once the model is fully trained so the following thing happens um we have a cnn we we get the first observation the first frame and we start stacking the frames until we get four frames once we have four frames we have the full state we do the uh forward prop we get the actions we send the action maybe up we send it to the emulator the emulator sends us back the reward and the observed state and we just do that in circle and we're actually following something called epsilon greedy policy and we'll see what it is in a minute so epsilon greedy um which basically means the following so instead of just taking the the max action every single time so eight in this particular case will sometimes uh with epsilon probability which is usually let's say 0.1 will take a random action so either up either down or no up with equal probability or with zero nine probability we'll be following this greedy policy where where we're actually taking the action that has the highest action value associated with it so that's the the rough uh sketch of how this looks like and this is just going to run in cycles blah blah blah until we receive the don flag from the emulator and once the down flag is over that's it we stop the game and one important thing to note here is that during this game we have something called a memory replay experience replay buffer and it's nothing else but collection of the experiences we collected during the gameplay so what happens is the following so we have uh let's let let me let me call this state s let me call the reward r uh let me call the new state so after we receive this frame and we update the all state we'll have s prime and let me call the action that we send to the uh emulator a so we'll be collecting these tuples s a r s sars okay we'll just be plugging these in into this replay buffer as we're playing so we have a state we send some action depending on that state we received some reward we received a frame we updated our state we got s prime and we just uh just kind of store this in this uh replay buffer and this one is huge like they use one million frames uh one million slots here so just keep that in mind once we have this we'll be later using these tuples to actually train the agent so so far you just saw how the agent is actually acting in the environment once it's fully trained now we can use these sars tuples to actually train it and i'll just gonna give you like a really high level overview and then we'll jump into the details so what we actually do is the following so we have a state s and we perform some action we got into state s prime and we received some reward for doing this and this is a tuple from our uh experience replay buffer okay so what happens the following so the the the q network here the cnm uh will basically um have some output for this state so we pass in the s and it will give us some key of s so some value uh for this particular action maybe this is the action like uh whatever we were using up so we get some value maybe eight and that was received some reward here and we get some uh new value here uh q s prime and the important thing to note is this is an important detail uh is this q network here is going to be some old frozen q network from some previous iteration and that's going to be used here and not the current network once we have the qs prime and r we're going to try and do a simple uh l2 loss minimization i.e so we're going to try to do we're going to try and make qs as close as possible to the reward we got plus this q in this new state s prime so that's that's the the whole goal of this training we're going to take mini batches of these sars tuples they're using 32 of these tuples and we're just going to do a simple gradient descent and do l2 uh minimization that's it so basically uh msc loss on on these uh fake fake label data the the tuples we were collecting so having said that now let's jump into all of the details and see how this actually actually works okay let's go step by step uh first things first so usually deep learning applications uh they require a large amount of hand labeled training data if you're familiar with computer vision with nlp with graph ml without whatever other subfield of machine learning or deep learning you know that we need a lot of hand labeled data on the other hand aria algorithms have they have to learn from a scalar reward as we saw like that's the score maybe in punk that's one or zero maybe in breakout game it could be for or in space invaders for every single like spaceship you kill maybe get three points whatever so it's a scalar reward and it's sparse you only get it at the end of the episode potentially it's noisy and it's too late obviously because it's sparse i guess these two are kind of correlated um okay that's the first thing another issue is that most algorithms assume that the data is independent you'll see the iid thing thrown away thrown around every single every now and then so that's basically means we have independent identically distributed data points from our data set but here in rl uh you typically encounter a sequence of highly correlated states so why is that well it's pretty simple when you have a pawn game and the ball is here in the next frame you can't expect the ball to be anywhere it won't be here or here it's going to be somewhere in this like circle and that means it's the next state is highly correlated with the last one and that's the problem in rl that's that we don't have in the computer vision usually we don't have it so furthermore the data distribution changes as the algorithm learns new behaviors which can be problematic for deploying methods that assume a fixed underlying distribution so what it means when you have when you're training your classifier on images in computer vision you basically have a data set of images and they form some static distribution which you're trying to figure out so that you can predict the class successfully here in rl when you're learning new policy you'll be because of your new behavior your new policy as we call it in rl you'll be visiting different states and that means you'll be seeing different data every now and then and that means that the distribution shifts and that's problematic and the way they solve this in this dqm paper is by using this replay buffer because it's going to have these source tuples from various various policies because in the meanwhile we're going to update the network and we're going to be storing new tuples with the updated network so we'll have data from different distributions stored in this uh replay buffer and that's one of the way ways they couple they they kind of fixed this this thing these instabilities caused by these uh assumptions being violated okay let's let's continue i already explained the games in atari we have uh just for your um kind of mental picture the the the imagery is really small it's 210 by a hundred 160 at 60 hertz so that's definitely not your 4k video so just keep that in mind the problem is was really small like otherwise the problem of learning these agents would be really hard back then in 2013 when the paper was originally published okay so important things to note notice here is the following so the network was not provided with any game specific information or hand designed visual features so basically you don't know the only thing you get for every single one of these games is the observation i the frame and you get the reward the scaler that's it so nothing else non-game specific information you don't get the hand design visual features you're using cnn to learn the future representations so no hand designing no hand crafting here and we don't have the internal state of the emulator so that means we're dealing with a partially observed mdp so that may be a mouthful so basically mdp is this formalism in rl which i won't get too deep into now so just a mark of decision process and basically it's partially observed because you don't know the state behind behind the emulator you just get the frame you don't know all the processes that are happening to actually render and prepare the frame so that means we're partially observing the system and that's one more difficulty that we have in in atari here and it's a common thing in the real world applications but yeah okay so it learned nothing uh it just learns from the video input from the reward and that's it uh and the only thing that's actually specific to every single of these games is the number of actions that's it so you have a cnn here seeing an agent and because this is punk you'll have three outputs probably so do not know up the the down and the up action in breakboard it's gonna be similar maybe in space invaders you can shoot aside from moving the pedal so maybe have i don't know like four actions so this is the only thing that's game specific the number of actions and that's kind of meaningful because even humans will have a different joystick so that that makes sense um okay so that's the only thing that's kind of game specific and the architecture so the cnn and the hyper params were actually kept constant for every single one of these atari games so that's that's that's a good step towards the uh artificial general intelligence although we are very far off because one thing you need to know here is that for every single one of these games they had a specific network trained only for that game so that means if you have seven games or 57 games you have 57 differently trained uh cnns so you can train one cnn and then deploy it as an agent in all of these environments you have to train a single cnn on a single game and use it only there and fun thing about rl like i just recently read this quote like rl is the only field the only subfield of deep learning where it's socially acceptable to train on your test data and yeah that's pretty much true because you're you're training on punk and then you're actually deploying the agent in the very same environment so just keep that in mind okay that was the initial things i wanted to explain here and now let's see let's let's go further now we're going to see some mathematical formalism of mdps and stuff so yeah let's start first things first um the task is partially observed and many emulator states are perceptually aliased as i already mentioned that the example with the punk ball they can go in any direction so it's impossible to fully understand the current situation from only the current frame and that's the reason they're stacking four frames together we need the temporal information in order to play the game okay let's see the formalism this is the quantity we want to actually maximize is it's the future discounted because of this gamma factor reward and gamma is just a mathematical formalism which uh makes you when t goes to infinity when the the number of steps here goes to infinity if we have gamma that's usually between that's always between 0 and 1 and it's usually actually 0.99 um when you have that the the this this array is going to converge to a finite number and that's desirable when you're dealing you don't want to deal with infinities right but usually you deal with uh finite episodes so this gamma doesn't matter so much you can sometimes even set it to one and nothing will break that's the first quantity the second one is the action value function and aside from this one you'll be seeing something called state value function but for now let's just focus on this one and um in optimal case uh you want to find the policy p this is what this equation is telling us so we want to find the policy p that will maximize this expression here and this is just the reward the expected reward uh from this particular state and this and taking this particular action so that's what the q value is telling you it's telling you the following if i'm in this state and i perform this specific action what's the expected reward i'll get i can expect from from this action in this state and once you get to the actual action value it's going to be something called bellman equation and you're going to see this a lot especially in the older rl algorithms where they used to use table lookups instead of neural networks so basically this is uh what the bellman equation looks like you have um for for the specific case of using something called td or temporal difference uh learning and basically the the the q will be equal the optimal value the q star will be equal to r the reward plus gamma discounted maximum q star across all the actions in the future state s prime so this is just something to wrap your head around if you've never seen this before it's going to be kind of uh look really strange but it's actually much easier when you see the implementation because this this tells us the following thing we take the cnn and we input the current state s and basically then we take another cnn and we put the state s prime and now we're going to find the max whatever the max here is we're going to sum it up with r and once we have the true action value function which we're trying to approximate during this training it's going to satisfy this equation that's what it says and don't get scared because of the expectation of operator it just tells us that we are basically uh sampling this next state from the distribution of that the emulator is providing us that's the this epsilon here but for the all uh like uh purposes you can just ignore it because this is how it's actually gonna work when you try and code it up you're just going to do this and then an expectation because you're sampling different states as you're playing the game this thing is going to be sold for itself so there is this discrepancy between equations and how it actually works the second discrepancy you can maybe notice here is that q is fed with s and a but actually as we saw cnn is only taking s and it's outputting the actions so that's just the legacy formalism that most of these papers have and once you actually see the code it's believe me it's much easier i mean it's not easier it's just the discrepancy is what makes me like kind of have this uh dissonance in my brain so i guess most of you already experienced it um okay so now the theory goes like this so this is what the optimal the true q uh q value function will will look like it's going to obey this bellman equation but initially our neural network is random so it's not going to obey this and in order to actually uh to converge to this true q value function we're going to do this something called value iteration and they converge to the optimal action value uh function so basically by just doing this thing here by just uh iterating and we'll see the exact algorithm a bit later but just stick with me for now so but just doing this we're gonna slowly uh get to the true value and after we update the q value because our policy is coupled if you remember with this q value by the epsilon greedy algorithm we're going to improve the policy automatically and that there is this theory from rl that says if you do that if you're improving q and if you're improving your policy which we're automatically doing here because we're coupled by epsilon greedy we're gonna converge to the actual optimal q value the q star uh action value uh function uh so yeah uh and all of this theory actually was first invented let's call it invented for the table lookup example where you had something like this so you basically had a matrix and along this dimension you'd have states s and along this one you'd have actions and now you can see why this is totally impractical s is going to explode in any serious real world problem as well as actions they can be continuous in some cases when in robotics for example so this is going to be huge and impractical and the second thing aside from memory problems we have something even more serious and that's once you figure out the value for this state action point how are you going to generalize to every single other point in this uh state action space you won't unless you do some interpolation but that's again computationally expensive because you have to interpolate because because of that the neural networks are really cool because they can just automatically by just uh updating the parameters you get the close uh you you also update the states around the state uh for free let's call it like that so that's something i wanted to mention uh there and now this is the thing i initially explained how we actually train this network so we have the targets and we're just trying to minimize the l2 loss so you see here we just have the just the difference and then squared we're just trying to minimize the l2 loss uh between the target and between the q and the target is as i mentioned the frozen network so this theta i minus one let me zoom in a little bit so uh we're gonna freeze in some point the the parameters and we won't gonna change it and we're going to use that network for the target network so how how it goes is the following we have the sar stomp tuple remember so we have the sars so what we're going to do is we have the q network that's our current cnn okay we input s here we know what the action is so we'll focus on that q value and that's that that's this q okay so how do we get the target well we take that frozen cnn some old cnn old like this we take we input the s prime so from the tuple again we have that information we input the s prime and now we're gonna actually maximize find the biggest value here and we're gonna add the r which we also have from the tuple and that's gonna be our target so that's gonna be y of i and once we have the value we're just gonna try and tweak the parameters here so as to approach that r plus q the the new q the q prime let's call it whatever uh so that's the whole point that's how it works you're doing l2 you're doing the mean square mean squared error when you have the mini batch and that's how you train the the parameters so now you may be puzzled with this replay buffer you may be puzzled with this frozen network some of those are actually just heuristics and there is no theoretical guarantee that it's going to converge but it does in practice and was inspired by by the actual theory that came from these um exact uh cases where we had the table lookups and not the function approximations so yeah hopefully that that makes sense um and they said here the parameters are from the previous iteration theta i minus one are held fixed when optimizing the loss function and blah blah so we won't be back propagating through this one so this is just uh uh yeah just the the parameters are frozen okay i think i've covered everything here um they just explicitly uh derive the gradient for the loss this is something that your favorite deep learning framework of choice uh does for you so you don't we don't have to cover this one so either python tensorflow will just once you specify the loss like this so the simple l2 uh the framework will find the gradients automatically using the autodev engine okay so let's see these two things first this q learning algorithm is model free that that means we're not trying to explicitly models uh the transition from s to s prime from this state to the to the next state we don't know the probability of transiting into the next state we don't care about it we're just doing model-free rl interacting with environment and learning the action value functions directly without caring about the actual probabilities and modeling these so that's why this thing is model free why is it off policy well it's off policy because it's using the epsilon greedy to collect the data into the replay buffer if remember so we are we're collecting into the replay buffer using the epsilon greedy and then the target is actually using the max policy the greedy policy and that's what that's another policy and so we're trying to learn the the the the the max q value while exploring while using the behavior policy that's how they call it the epsilon greedy and the target policy is the greedy policy that's why it's off policy so intuitively what we're trying to do here is we want to have this queue or our cnn here be able to predict what will be the expected reward plus the whatever max here is so we're going to try and learn what the this particular state what's the max value we can expect so hopefully i didn't confuse you here let me know in the comments if this was clear enough i can i can try and explain a bit a bit better next time whatever okay let's continue uh and see what else is there um so furthermore it was shown that combining model-free reinforcement learning such as q-learning with non-linear function approximators uh and using off policy learning could cause the q network to diverge so that's the theory and that's pretty problematic because you want to have stable training and uh fortunately dq and paper they didn't have the problem because of the memory replay buffer and because of the target network being frozen actually they avoided the instabilities but this problem is something known as the deadly triad in the literature so this is just an excerpt from the uh rich sutton's uh book and an introduction to rl which is uh like a must read if you're if you're doing rl really seriously at one point of time you should probably check it out i still haven't i usually start by doing high level resources first and then slowly going down to papers and only later do i read the book so that's maybe counter-intuitive but that's the best approach for me uh and he said also here so when you combine the function approximation the bootstrapping so bootstrapping being td and uh uh other than td you could be using something called mc like a monte carlo or you could be using something in between like td or lambda and you don't have to know what all of these are but it's basically uh instead of doing just uh r plus the q value here and you're bootstrapping that way you just roll out until the end of the episodes and just use the actual rewards instead of trying to use the q value to bootstrap and figure out the actual q value now so that's the bootstrapping part and finally the off policy and that's why it's called the triad try for three so basically when you have these three elements combined together you're roughly you're probably going to have instability problems but yeah uh as as dqn uh folks showed they they didn't have it because of some of the hacks hacks they used the replay buffer and the target network are the main two components that made it work okay that's that's that's that part let's go further and let's actually see the the algorithm uh and i'll try to walk you through this one so first we initialize the memory the the the replay buffer to capacity n and they had one million slots here just for for being less abstract uh they initialized the q the action value function q with random weights so just take your cnn and use whatever your favorite init method is maybe gloria or xavier whatever your initialization method is and then we do the following we trade over the episode we have m episodes we initialize the sequence so we take the initial frame from the atari environment we just do some pre-processing so that's the four frames the cropping some details you'll see up in a bit and then for every single step in that particular episode we do the following we perform an action using epsilon greedy policy so that's this we probability epsilon we select a random action otherwise we select the max we we we max it out we do the greedy approach then we send the action to the environment so that's if you're using you're usually going to be using something like openai's gym so that's going to be roughly uh correspond to dot step and you're going to pass the action and then the environment is going to return the observation the reward and the dom flag which signals whether the episode is over or not once we have that we just pre-process that frame and we we store the transition the sars tuple where they're using this five to denote the processed state so this is basically us the state so we just store the sars tuple inside of the replay buffer and then uh we're gonna sample random mini batch from d now just for your understanding they're going to do this uh i think the actual number was four so every four actions you perform for four actions and only then do you sample the the mini batch because otherwise the you need to feel a feeling remember this pre-processing function expects four frames so we have to kind of fill it up before we start doing the inference through the q network so that's why they're gonna do i think that's the reason they are going to do it every four steps and again you can see the we we just formed the target uh and that's either reward if you're in the terminal state if you're at the last frame of the game otherwise we're going to bootstrap using cube the old q function and that's the target and then we're going to do gradient descent on the l2 of these two and that's it so i think the the the mini batch size was 32 that means we'll we'll take 32 samples randomly from this buffer so just 32 random sars tuples and we're gonna form this and we're just going to average it out and let the deep learning framework do the rest for us find the gradient automatically and minimize this loss and that's the whole algorithm hopefully now it's a bit more clear and um i think i mentioned this but let me just reiterate so learning directly from consecutive samples is inefficient due to strong correlations so i mentioned the example with the ball in punk and uh being correlated with the next frame so randomizing the samples by taking random samples from the replay buffer we are breaking these correlations and we're reducing the variance of the updates so that's the important trick to make this thing stable and to make it work um i just took a snippet from the so this paper is from 2013 and later they published a paper in nature and so i just took a snippet from that paper and they have additional line here so every c steps just swap the target that's the frozen q network i was talking about so you just take the car network and every c steps which is maybe every i don't know a thousand i think these 1k uh batch updates you just swap the networks and use that one as the frozen network to predict your target labels and that's the additional detail i wanted to mention okay with that out of the way uh let's continue and dive into a bit more details about the pre-processing so what they do is they they convert the rgb to grayscale they down sample it to uh this resolution and then the crop to 84 by 84 because that was the constraint that from the network they were using and they were using if you take a look at this reference 11 that's actually alex net the famous network that uh that caused the um the uh imagenet moment and uh it had the constraint of using only square inputs that's why they cropped to a square resolution um the second thing they do is as i already mentioned is they take the last four frames of history and they stack them together to produce the input to the q function and that was the state that we were using all along okay so those were some additional details and um now one more thing uh i i kind of highlighted this 8x8 because alexnet back in the days they were using much wider kernels uh compared to uh and shallower networks compared to today's networks which are uh have kernels that are usually three by three and they are much deeper especially with the arrival of resnet that allowed us to using those skip connections to have much deeper networks going to 150 more layers and yeah that's just kind of thought it's interesting one thing they mentioned here additionally is um so the main advantage of this type of architecture is to compute the q values for all possible actions in a given state with only a single forward pass so imagine if we if we were actually uh treating the formulas formulas like exactly so you'd have to do the following thing you'd have a cnn you'd input a state and you'd input an action and then you'd get a q value for that action but then you had to do a for loop you just had to kind of iterate this in a for loop to get all of the q values you need to be storing them somewhere in summary and um so you're using additional time you'd be using additional compute and you'd be using additional memory if they haven't made it the way they had and that's just input a single state and you get all of the actions you get the q val values for all of the actions and that's how it works so that's additional engineering detail that's kind of fun um let's jump to ex to experiments finally let's see how this thing performed i already mentioned it kind of surprised the humans in some games and many previous methods so let's see details about that um additionally because some games like breakout uh or space invader maybe when you kill a spaceship in space invader you get plus three reward where in pong you get plus one so because these scales differ between the games what it did and this is again something that's atari specific so they are not completely agnostic to the setup uh they are actually clipping it to plus one or minus one uh if if we get minus seven reward that would get clipped to minus one if we get plus three that will get clipped to plus one and that's additional trick they used uh to make it more stable and to be able to use the same set of hyper params like the learning rate for every single one of these games atari games um again one more detail the epsilon was actually the epsilon greedy the epsilon was annealed linearly from one to zero one over the first million frames at a uh and fixed to uh 1.0.1 afterwards so what that means that means the following so we have epsilon we have the steps in our in our rl environment and basically it starts with one and linearly annealed to zero one and this is one million frames here and that means we start off being completely random our policy is completely random so we'll be picking whatever comes whatever state comes into the queue network we pick the action without caring which is the max one we just take one action and that's how we behave that's how we explore the environment initially and that's an important concept in rl exploration versus exploitation trade-off and here you can see usually the rl algorithm start exploring maximally to let it explore the environment and learn which states are better by random and then start kind of being more greedy being more deterministic about your actions and less stochastic and that's you you usually see policies uh like the schedules like this one and then we'll end up having 0.1 epsilon throughout the rest of the training that's what they did they also additionally use this frame skipping technique which means that um just a simple again atari specific thing uh they just pick an action for and then they just repeat the action for the next three steps uh just they're just sending the same action to the emulator and only after the fourth frame do they actually pick the new action here they say that they use k4 so they they they actually pick the action every four frames from the that comes from the emulator uh except for the space invaders where they actually notice that the lasers are invisible if they take that period so they actually had to use three which is again i think this is the only game specific detail they actually uh used so just uh yeah it's a fun fact here i'm not sure whether once they are doing those dummy actions where they're just repeating the last intelligent action where are they actually feeding those frames into the state into the circular buffer that has four slots so i'm not sure if they're using these frames to fill that up or they're only using the frames that got received from the environment when they produce an intelligent action so that's just something i'm confused about maybe if you know uh write it down in the comments i haven't checked the original implementation i did check some other gists and implementations of dqm so i'm familiar how it works there is one thing i want to mention here and this nicely summarizes the difference between research and engineering and that's that for this particular atari game they noticed this thing so first to encode a single frame we take the maximum value for each pixel color value of the frame being encoded and the previous frame this was necessary to remove flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames an artifact caused by the limited number of sprites in atari 2600 can display at one can that can display it once so uh this is totally uh not important for this research paper and for you understanding this paper but it is important to understand that there are so many details that you actually understand when you start implementing the thing so reading research is really nice and everything but um until you start looking at the actual code and try to implement it you you probably first don't understand the actual theory and secondly you don't understand all of the pain points you actually make this work in a real wor in a real world product so just something i want to mention and also what i found useful is uh for example when i read this dqm paper the first time i just found some gifts some gists and i went through the code and that made it much less abstract and clear how it actually how it works so i i kind of suggest that you aside from reading papers maybe consult some code and that's a good nice to nice trade-off to understand uh the research better and that's somewhere like half the way between you actually implementing the project and you not seeing any code whatsoever so yeah that was a small digression hopefully it was useful for some of you let me know if this was useful in the comments uh things like this or whether this just distracts you okay let's continue let's see how the agent progresses during the training basically here you can see that the rewards so here we can see the training epochs so and here we can see the average reward that the agent is receiving and we can see on the breakout game and upward trend so that means that the agent is learning better policies and better action values or qqq values so basically on the sequest we can see the same trend it's there is a lot of noise but uh still the trend is kind of upwards so yeah a better metric they used is this the average q on breakout and uh the average q on whatever game there may be and you can see that that one is uh pretty much monotonically increasing and has a lot less less noise than than these ones here and let me just explain you how they actually uh calculate that one so we collect a fixed set of states by running a random policy before training even starts and they track the average of the maximum predicted q values for these states so that's uh that looks the following basically before the training even starts before the agent starts learning they do the following so we can represent the uh the environment and the game as an mdp and i didn't explain quite what mdp is but for now just uh you can imagine it as a bunch of states and you can transition and you can receive rewards and yeah that's the formalism but basically what they did is they took a couple of uh states and those are those four frame tuples remember so they took a couple of these and they just stored them into some memory and basically later on uh while the agent is actually learning they just pass these specific values so let's call this one let's call this two they'll just feed it through the cnn and they'll find the max values the max q value for this state max q value for this state and this is going to be accumulating these like summing them up and then averaging by the number of states i they are going to keep the average of the max q values for these states and that's the metric they were using here and it proved to be much more stable and better to keep track of the learning progress of the agent um okay that's the those are the curves uh here once you once you actually train the uh the agent uh you can see interesting behavior and here you can see the following so uh here you can see the that's our ship that we're controlling that's the agent the the algorithm is controlling the ship and once the enemy ship enters the screen that's the point a and you can see that the q value increases because we are expecting reward once we kill the enemy spaceship we're gonna see an increase in score and the q value correctly anticipates that score increase and then has a spike and then you can see once the torpedo is shot and we are really close to killing the enemy ship we're at point b where we expect a huge reward and once we actually kill the ship and we enter in this state here the the value drops and it drops because it only expects the future rewards so we just received a huge reward and that's why we dipped so you can see that it's really meaningful for this specific scenario and it makes a lot of sense um that was another interesting uh thing that i wanted to mention and here is one additional detail so in addition to seeing relatively smooth improvements to predicted q during training we did not experience any divergence issues in any of the experiments so despite lacking any theoretical convergence guarantees the method is able to train the network with rl algorithm signal and grading descent in a stable manner so again small discrepancy between the theory and the engineering or the practice uh sometimes things just work and later a couple of years down the road somebody figures out why this was working or yeah but yeah sometimes a lot of times uh the engineering and the technology is uh way in front of the actual research and theoretical understanding that's my honest opinion about this um you can disagree and i'd love to hear your comments on that yeah so results are here dkn kind of destroyed every single other baseline so this is the random policy if you just play it random if you play the game at random these are the scores you get uh sarsa that's an online algorithm similar to q learning i won't get into the details contingency again very similar and you can see uh usually q learning has better results and all the dqm has much better results and all of those baselines and it's actually even better than humans in some games like here on breakout enduro and pong it's actually better than the human and that's kind of interesting and the scores here how they uh got these are so they just took some human and that guy the poor guy or girl were just playing they were just playing the game for two hours and once the two hours are done they just take the scores that they achieved they sort them and they find the middle one that's the median or the 50 50th percentile and that's the score they publish uh here and you can see that yeah in some cases dqm achieved better results so one important thing to mention here is that uh in their subsequent paper that were published in nature in 2015 they actually played on all the 57 atari games and they achieved impressive results where they were actually better than humans on 29 29 out of 57 games and that's yeah quite impressive especially back in 2015 i know we have some improvement improvements like agent 57 and some other algorithms and i'll be covering those in the subsequent videos so yeah stay tuned uh but the the interesting thing you can notice here is that all of the games they require like uh fast reflexes like boxing like uh like pong uh are much easier for our agents to learn how to play than some games like montezuma's revenge is a really famous game it's a it's tough it's a tough one you need to play to understand why i'm saying this it requires a lot more strategy than pong or breakout or those other games and that's why you can see we have actually zero score here so the algorithm cannot dqn cannot learn in this vanilla form cannot learn to play this game at all so that was just some fun fact and as i mentioned so the this cuber sequest and space invaders so those are these three games you can see that results are really poor compared to humans and the reason is that they're challenging because they they require us to find a strategy that extends over long time scales and that's something that's if you're familiar with any other field like nlp uh when you're dealing with long sequences you're gonna end up having uh problems uh catastrophic forgetting whatever the the problems that rnns had and then lstms kind of solved and transformers so yeah that's the reason they uh that you can you can basically just by analyzing the game kind of get an intuition why the algorithm is failing on specific games and that's pretty much it these are just some references i think i've covered it pretty detailed uh hopefully if you stuck with me until the end of the video i'd love to hear your thoughts uh did you like this one better than the last ones and if you have any tips on how to improve this further feel free to comment down below i'll try to to answer asap so until next time you know the drill hit that bell icon subscribe and share the good word and see you in the next video until then keep learning [Music] deep

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ In this video, I cover the paper that started the hype for deep RL - Playing Atari with deep RL, which introduced the DQN or deep Q-network. You'll learn about: ✔️ All of the nitty-gritty details behind the paper ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: https://arxiv.org/abs/1312.5602 ✅ A decent gist: https://github.com/higgsfield/RL-Adventure/blob/master/1.dqn.ipynb Note: there is an error with the target Q-function he didn't freeze it but it's good enough - until I implement something myself :P ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 01:05 High-level overview of the paper 07:00 Experience replay buffer 10:35 Difficulties with RL (correlations, non-stationary distributions) 13:30 DQN is very general 17:00 MDP formalism and optimal Q function 22:37 Function approximators 23:50 The loss function explained 28:40 The deadly triad 30:55 Algorithm walk-through 35:00 Preprocessing and architecture details 37:20 Additional details - normalizing score, schedule, etc. 42:50 Agent training metrics 47:15 Results ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY: Discord ► h
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 34 of 60

1 Intro | Neural Style Transfer #1
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
2 Basic Theory | Neural Style Transfer #2
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
3 Optimization method | Neural Style Transfer #3
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
4 Advanced Theory | Neural Style Transfer #4
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
5 Anyone can make deepfakes now!
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
6 What is Computer Vision? | The Art of Creating Seeing Machines
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
7 Feed-forward method | Neural Style Transfer #5
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
8 Alan Turing | Computing Machinery and Intelligence
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
9 Feed-forward method (training) | Neural Style Transfer #6
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
10 What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
11 Semantic Segmentation in PyTorch | Neural Style Transfer #7
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
12 How to get started with Machine Learning
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
13 How to learn PyTorch? (3 easy steps) | 2021
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
14 PyTorch or TensorFlow?
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
15 3 Machine Learning Projects For Beginners (Highly visual) | 2021
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
16 Machine Learning Projects (Intermediate level) | 2021
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
17 Cheapest (0$) Deep Learning Hardware Options | 2021
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
18 How to learn deep learning? (Transformers Example)
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
19 How do transformers work? (Attention is all you need)
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
20 Developing a deep learning project (case study on transformer)
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
21 Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
22 GPT-3 - Language Models are Few-Shot Learners | Paper Explained
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
23 Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
24 Attention Is All You Need (Transformer) | Paper Explained
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
25 Graph Attention Networks (GAT) | GNN Paper Explained
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
26 Graph Convolutional Networks (GCN) | GNN Paper Explained
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
27 Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
28 PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
29 OpenAI CLIP - Connecting Text and Images | Paper Explained
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
30 Temporal Graph Networks (TGN) | GNN Paper Explained
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
31 Graph Neural Network Project Update! (I'm coding GAT from scratch)
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
32 Graph Attention Network Project Walkthrough
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
33 How to get started with Graph ML? (Blog walkthrough)
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
35 AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
36 DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
37 OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
38 MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
39 EfficientNetV2 - Smaller Models and Faster Training | Paper explained
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
40 Implementing DeepMind's DQN from scratch! | Project Update
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
41 MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
42 DeepMind's Android RL Environment - AndroidEnv
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
43 When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
44 Non-Parametric Transformers | Paper explained
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
45 Chip Placement with Deep Reinforcement Learning | Paper Explained
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
46 Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
47 Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
48 GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
49 VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
50 VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
51 Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
52 Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
53 AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
54 RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
55 DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
56 DETR: End-to-End Object Detection with Transformers | Paper Explained
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
57 DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
58 DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
59 Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
60 Fastformer: Additive Attention Can Be All You Need | Paper Explained
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany

This video explains the DQN algorithm and its application to playing Atari games. It covers the basics of reinforcement learning, Q-learning, and deep learning, and provides a detailed explanation of the DQN paper. Viewers can learn how to implement model-free reinforcement learning and apply Q-learning to estimate future rewards.

Key Takeaways
  1. Stack four frames together to form a state
  2. Send the state to the CNN to get action values
  3. Select actions using epsilon-greedy policy
  4. Send actions to the emulator to get rewards and new states
  5. Update the experience replay buffer with experiences
  6. Train Q network with L2 loss minimization on mini-batches of sars tuples
💡 The DQN algorithm is a model-free reinforcement learning algorithm that uses a CNN to learn control policies directly from high-dimensional sensory input using RL, and applies Q-learning to estimate future rewards.

Related Reads

📰
Follow-up: The ArxivLens Protocol: Transforming Research Nois
Learn how to apply the ArxivLens Protocol to create dynamic grant-allocation pools that rebalance based on citation-impact signals, transforming research noise into actionable insights
Dev.to AI
📰
On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]
arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia
Reddit r/MachineLearning
📰
CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available
Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development
Medium · Data Science
📰
Found a potential mistake in an ICLR 2026 blogpost [D]
Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications
Reddit r/MachineLearning

Chapters (13)

1:05 High-level overview of the paper
7:00 Experience replay buffer
10:35 Difficulties with RL (correlations, non-stationary distributions)
13:30 DQN is very general
17:00 MDP formalism and optimal Q function
22:37 Function approximators
23:50 The loss function explained
28:40 The deadly triad
30:55 Algorithm walk-through
35:00 Preprocessing and architecture details
37:20 Additional details - normalizing score, schedule, etc.
42:50 Agent training metrics
47:15 Results
Up next
How to get started With Drug Discovery using BioAI: Computational Biology ( 4K UHD Med Masterclass )
Sudarshan's Multiverse
Watch →