DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany · Beginner ·📄 Research Papers Explained ·5y ago

Key Takeaways

The video explains AlphaGo Zero and AlphaZero, agents that learned to beat human players in Go, Chess, and Shogi through self-play and reinforcement learning, using a single neural network and Monte Carlo Tree Search.

Full Transcript

what's up in this video i'm going to cover the alphago zero paper or mastering the game of go without human knowledge and i also cover the alpha zero paper uh i i'll actually tell you some of the modifications that uh that are needed to get this thing to alpha zero level and uh just a hint it's it's a really small modification and basically reapplying this to chess and shogi as well so uh what's the trick with this paper uh and hopefully you already watched my previous video on alphago if you haven't i strongly recommend you go ahead and watch it i'll link it somewhere here basically the main thing is they are not using human experts data anymore so they sit here here we introduce an algorithm based solely on reinforcement learning without human data guidance or domain knowledge beyond game rules starting tabula rasa or blank slate in latin uh our new program alphago zero achieved super hum superhuman performance winning hundred to zero against the previously published champion defeating alphago agent so a small digression here if you're not familiar with this basically alphago is a lineage of agents even though they published these two papers or actually three and then later mu0 there were a couple of more iterations uh in between so it's kind of a spectrum and let me just kind of uh explain how that functions so the previous paper uh the alphago paper was actually alphago fan because that's the model that defeated fanhui the european champion in go back in 2015 then they had alphago lee which defeated elysidol back in 2016 uh and that's the the famous match where alicia lost four to one and that was the first time that uh an agent uh and like an algorithm could defeat the grand master in go and finally they built uh something called alphago master which basically won 60 to zero against all of the best uh grand masters in go uh in 2017 i think and finally we're here alphago zero which uh yeah which is basically the the best iteration so far and the only version that's not using uh human experts uh knowledge so going from here to here uh it's a kind of a spectrum where uh the amount amount of compute like the both the efficiency of the algorithm was improved as well as the amount of hard-coded data and heuristics was kind of slowly going down so finally zero doesn't use the human data and uh is much more efficient so yeah that's the paper we'll be reviewing in this in this video uh anyways so let me see what the differences are between this paper and the alpha go paper uh the first one is so first and foremost it is trained solely by self play rl uh that means if you remember from the previous video that we won't be having these networks anymore so the sl policy is gone because this one was trained on the human experts data their fast reload policy is also gone so those were the two tasks that were leveraging the human expert data and we're left with these two but let's see what else is different so second it uses only the black and white stones from the board as input features so that means we're not using anything additional uh heuristics and handcrafting of these these features we're just using the raw board data and we just stack a couple of those because we need some pre-history in order to play this game optimally and that's the only thing so we're basically using raw board data and that's it no no no hand crafting no domain knowledge integrated there third it uses a single neural network rather than separate policy and value networks um so that means we're basically merging these two together so we'll be adding this will be combining somehow this tube into a single network and finally it uses a simpler tree search that relies upon the single neural network to validate positions and sample moves without performing any monte carlo rollouts so again if you remember from alphago paper when we were doing the monte carlo research and we got to the leaf position so this is some state and we basically did two things one was to pass the state into a value network and the second thing was to do a rollout uh basically playing the game until the end with this fast roll-up policy so basically with this thing and um would get some value and then would uh back back propagate those two information so that both the value and the rollout outcome up the tree basically we don't have we don't have this anymore and that's the the the new modification that they introduced so it's much simpler so if you take a look at it testing is so much simpler in every aspect we don't have this we merge these two together and the mcts algorithm is uh simpler so and we also achieve much better results so that's quite impressive if you ask me okay hopefully you got some intuition of the differences between alpha zero and alpha go so um let's continue um so let's see what the output from the network actually is so basically it's a tuple we have the policy vector and we get the value scalar coming from the from the network so the vector of move probabilities p represents the probability of selecting each move a including the pass so additionally if you take a look at the alpha go implementation i think they didn't have the the past move so they they just had the output vector of the policy vector was just 19 by 19 flat vector so basically 361 dimensional vector and here we have uh plus one for the pass move uh additionally we have the the the value part of the network uh combined together so if you take a look how this looks like you basically have something like this this is some kind of a cnn and then we have uh two streams coming off on top of this so this thing is the policy part and then we have the the value network so this is our new model hopefully this is clear enough and uh additional detail is they're using resnets instead of cnns so previously in alphago they were using plain old convolutional neural networks now they're using resnets and hopefully you're all familiar with those so basically what resnets did is they just kind of added this residual connection which will be which will add the feature maps from the previous layer on top of the process features and they show that by adding this small trick they can train much deeper nap works and this was a one of the major points in the deep learning history so yeah they they are now using resnets instead of cnn's pure cnns okay let's slowly start digging into how the actual algorithm works and basically the i'll be stressing the differences with alphago uh many things overlap so yeah okay so the main difference uh in the training procedure is that they integrated the mcts directly into the training and they're not using it just for the uh during the play so basically this is how the network is trained so you have a self-play game so take our our own agent uh you take the basically that's the uh the network that has both the policy and the value and you just roll out the game until the end state now this is the trick uh so in this state uh you you pass it into the network it outputs the probabilities and the so this is kind of the probability vector and the the the value scalar so now instead of just uh picking the action directly from this output what they do is they do the mcts over this state so this is depicted here and they actually pick the action from this probability distribution which is much better uh prediction than the raw probability prediction from the directly from the network so uh after just picking the action we get into a different state we again do the mcts and we roll out all of these until the end and we're just saving this this tuple so the data for each time step t is stored as state policy so that's the this vector here we're storing this one and we're storing the outcome so the basically the uh the end result basically either plus one if the player one or minus one if it lost so this is basically the main trick doing mcts during the actual self play once we have that what we do is we in order to train the uh agent we do the following we take those uh tuples the restored and we want to make sure that the output from the raw network so the policy vector that comes directly from the network uh gets as close to the mcts policy as possible secondly we just regressed the value function so as to be really close to the final outcome of that particular game so um basically here is the equation you can see uh we do a simple regression thing uh mean square error for the value function and we do the uh simple simple cross entropy loss here so this has a minimum so the p vector is the probability coming from the raw network the pi is the probability coming from the mcts and we want to make sure that the probability coming from the raw network is as close as possible to pi and that's the minimum of this function so basically that's when we hit the entropy limit uh before that we have the non-zero kl divergence and once we get to uh to the same to once p becomes equal to pi we have zero kl and we converge to minimum so basically that's why this thing uh pushes p to be close to pi uh finally the third term is just uh basically l2 regularization and uh that's it it's pretty simple it's much simpler than alphago um and yeah uh let's see what i see here so the neural network parameters data are updated to maximize the similarity of the policy vector p to the search probabilities p pi and to minimize the error between the predictive inner v and the game winner z as for the monte carlo research a short recap how it works basically they're using the again the puct algorithm so the upper confidence bound for trees algorithm again basically you start from the root state and you pick uh the values so as to maximize this puct score so the q value the action value function plus this u term which uh consists of which has this visitation term inside and priors and picking the highest ones they get to the leaf node and finally uh in in the case of alpha go zero they don't have the uh actually the expanding threshold which was around 40 and alphago they just expand the network every single time so once you get to the leaf you just expand it and basically what it means is you you pass that state into the network and it will output the probability vector and the value and you'll be back propping that basically back upping the the value function up the tree and you'll be using the p as the priors for the uh next uh notes in the in the tree so basically next time if one of the threads comes here it will pick up one of those of these uh children depending on the again on the puct score so maybe come here you will again expand this one and that's how we actually build up this monte carlo tree search and hopefully that that was also already clear from the previous video so here just depicted that the value function is backpropped backed up up the tree and they are updating the statistics both the visitation counts as well as the uh basically the the the monte carlo estimates okay that's that's it uh that's it and uh so they said here over the course of training 4.9 million games of self-play were generated using 1600 simulations for each mcts which corresponds to approximately 0.4 second thinking time per move so that means when you're building up these trees during the the self play you do these you do 1600 of these simulations in order to build the tree and then you pick the um the action that actually maximizes the visitation count in the root node um they have some temperature coefficient but we'll get to that a bit later basically that's it um okay let's continue um and after doing uh this training uh surprisingly the alphago zero outperformed alpha goalie after just 36 hours so you can see it here after just 36 hours basically uh it's already better than the agent that that has beaten elise dole back in 2016 and that's impressive uh and they said here the alphago lead was actually trained over several months so a lot of compute went into this thing so yeah keep that in mind alphago zero used a single machine with four tpus whereas alphago lee was distributed over many machines and used 48 gpus alphago zero defeated alpha goalie 100 to zero so that's some serious performance that we got from this pure self-reinforcement learning approach and that's amazing again you can see the curves here the dotted line here is the dele method uh the the purple one is actually the same architecture uh as as the uh alphago zero but just using the human experts data and you can see it can't achieve the elo score as high as alpha go zero okay looking at this curve here uh we're trying to see how good these networks both the alpha go zero as well this supervised learning network how how good are they predicting the actual moves that human experts will make and you can see that the supervised method is actually better here and that shouldn't surprise you because uh the goal for alphago zero is not to be better at predicting human express move it's it's to be the best agent the best player and that doesn't necessarily correlate with human play and we'll see that exactly happening a bit later where alphago zero discovers certain joseki's so those are some corner sequences in go they weren't discovered by by humans prior to this agent so that's that's awesome um and we can see it here actually looking at the uh embassy of professional game outcomes so the uh alphago zero is much better at predicting the final outcome of the games than the supervised approach so that means if you are in a particular state and we just um basically uh pass it through the the network it will be better at predicting the final outcome in those games played by humans than the sl uh network so yeah that this is actually more important than this chart keep that in mind okay uh aside from that they try to decouple the uh the the contribution that came from the algorithm and what came from the actual architecture improvement because they are using resonance if you remember so looking at these charts here um the the leftmost uh the the this leftmost network is actually a dual resnet meaning dual is because we have a single architecture that that outputs both the policy as well as the value and here we have the cep conf that's basically single network for both policy and for the value network so that's something alphago used and it's using the simple cnn instead of resnet and you can see that the yellow rating goes up so everything else is pretty much the same and we get a huge boost just from the architecture and these are the other combinations so just using separate resnets and using dual architecture but with cnns um similarly looking at the prediction accuracy of the human players this one is actually a bit worse but looking at the msc of the game so just predict regressing the outcome uh this one is the best and finally this architecture gave them the best results so that's what they're using okay one more detail they say here this is partly due to improved computational efficiency but more importantly the dual objective regularizes the network to a common representation that supports multiple use cases so the idea here is the following so you have a cnn and since you're training the policy so this is the policy vector here and this is the value scalar so when you just do your gradient update of your network for the um to get closer to the mcts probability vector you're updating these features but these features are subsequently used by the value uh stream here and so basically whatever you do you're improving both the value and the policy and your you're basically figuring out the features which are uh good for figuring out both the policy and the value and that helps that helps boost the performance okay uh let's see some of the knowledge that this network learned uh during this training and uh in this chart what you can see in the top row here are some of these joseki's i mentioned and i i don't know how to play go but i don't know enough to just kind of understand what's happening but if you if you're familiar with go maybe this will be even more interesting for you uh basically here you can see these are the jasekis that humans play and you can see that the alphago zero actually reinvented these at some time during the training so this first joseki was discovered like maybe around 10 hour mark then the second one was discovered i don't know like maybe 15 hours and the last one was discovered maybe around 36 hour mark um and that's super fun and we'll later see how the frequency was increasing so at one point of the training it was heavily using one of these josekis and that then it kind of just ditched them away because they were actually worse it found some better patterns and started using those uh in the second row what we can basically see are the josekis which were the most frequent josekis at that part of the training time frame uh so basically if you take a look at i don't know maybe this one so around the 48 hour mark it was heavily using this particular pattern and uh yeah and finally the the last row tells you uh basically shows you the games so these are just so we take this maybe three hour mark we take the one particular self play game and we just um plot it here and you can see how it evolved so here it was it was playing really greedy and clustering all of the moves the black and the whites in the same region of the of the board and as the game progressed we ca we came to these uh more subtle again i don't understand go but i was after reading the paper i understand this so basically here it's leading multiple battles battles uh distributed across the board and it's much better player now after 70 hours than at the beginning so those were some of the knowledge that uh the alpha zero basically rediscovered and that's pretty damn impressive if you ask me okay let's continue and see what's up here okay so they say here ultimately alphago zero preferred new joseki variants that were previously unknown i mentioned that already so that's pretty much fascinating surprisingly this letter capture sequence is one of the first elements of go knowledge learned by humans but they were only understood by alphago zero much later in training and that's cool because that suggests also that it's not constrained by any way by the human priors and knowledge so it actually discovered something that was obvious to humans much later but it also discovered some things that were not obvious to humans and that's awesome and they say here the human kind has accumulated good knowledge from millions of games played over thousands of years collectively distilled into patterns proverbs and books in the space of a few days starting tabula rasa so blank slate's totally random alphago zero was able to rediscover much of this skill knowledge as well as novel strategies that provide new insights into the oldest of games and they're just romanticizing here a bit okay uh basically uh it's pretty awesome and looking at the charts uh you can see that the uh alphago zero uh beats alpha goalie after just a couple of uh days and then starts being better than this alphago master version after 30 something days again some similar chart here uh basically this is the alphago zero performance this is the master the lead the the fan models and finally we have some commercial go programs which you can see are much worse than even the alphago fan and then we have this uh basically open source go program which is even burst finally this one is basically the the raw network performance so without using the mcts by just using the probabilities coming from the policy network and playing using that one uh you get much much lower elo score than using mcts so search is really important for playing go efficiently okay that's it that was the high level overview i wanted to show you and and now we'll get into some details but before that let me just show you those giuseki's i was mentioning so here um so this is some jaseki that's uh the humans play a lot so it's it even has a name five three point press whatever and you can see the frequency here and similarly for other jasekis and you can see for some patterns so this like night smooth pincer the frequency peaked here and then slowly decreased after 70 hours of training so that means it was discovering and ditching the patterns so that's cool um this is the the the second chart shows you um some of the josekis that was playing really often at one point of training so again here three by three invasion he had a peak here she was playing this this particular jaseki a lot at one point of time and so on and so forth so that's that's it now let's dig into the details of the actual algorithm okay so i've covered this part already uh this is the spectrum i was mentioning so they went from fan to lee to master and finally to alphago zero and each time increasing the both the efficiency so this one used 176 gpus then the lee version i think used something around 47 tpus then they had alphago master which used even less hardware and finally alphago zero actually uses only four tpus and it's much better as you saw than lee and and master and every version so far um and yeah they say here okay so uh basically they said here the primary contribution is to demonstrate that superhuman performance can be achieved without human domain knowledge we reiterated that already a couple of times to clarify this contribution we numerate the domain knowledge that alphago zero uses explicitly or implicitly either in its training procedure or it's mcts these are the items of knowledge that would need to be replaced for alphago zero to learn a different alternating markov game and this is the main point uh where basically after you just kind of modify a couple of these you get to alpha zero and you can play both the chess and shogi as well as go and uh the most important parts here so basically you will wanna ditch the symmetry part so the rules of go are invariant under rotation and reflection this knowledge has been used in alphago zero both by augmenting the data set during training to include rotations and reflections of each position so if you remember from the alphago video basically each time you want to evaluate a position you take a random element from the dihedral group that has eight elements so basically because you have two reflections and four rotations you get to eight elements and you were just using those and that will make your data bigger and also uh they showed improves the training uh so we'll have to uh expel that part because other games such as chess they don't have that symmetry as as inherent part of the game okay some details here so the monte carlo 3 search parameters some bayesian optimization was used to figure those out and this is a really important part so alphago zero self-play training pipeline consists of three main components so the first thing is neural network parameters are continually optimized from recent self-play data that's the first part uh and alphago zero players are continually evaluated that's the second part and thirdly the best performing player so far this alpha theta uh star is used to generate new self-play data and let us see how that exactly works so um you basically have something like a replay buffer and i i'll draw it like this you have the current best network and i'll just draw it like a box and we have a bunch of self play games uh happening in parallel so how the actual framework looks like is the following so we have this uh network and that's the best agent so far and we're just taking data from the replay buffer and we are updating the network uh next up every every thousand uh updates will be dumping a checkpoint model and we'll be comparing the checkpoint model with the last a previously stored best model so we have the this thing they call alpha theta star so that's the best agent and interesting fact is that this agent so let me draw it like a like a rectangular uh basically they use those weights uh in all of these self-play threads so when a new when one of these uh self-playing threads finishes uh it will want to start over again so we'll just pull the best parameters here and it will be using that network weights to play the game and store it will be storing uh the tuples so we we mentioned so the s the pi and the outcome of the game so those tuples will be stored back in these in the storage so that's how the dynamics of the whole pipeline works um so i haven't finished this part so once uh we dump the checkpoint uh what we want to do is we want to do evaluation and basically what they did is they play 400 games between this checkpoint and between the best model and whichever model wins that's the the model we'll be uh using uh for all of the next uh self-play game so the next self-play thread so if this one finishes and wants to start over again it will just pull the the newest the best parameters and it will start playing the game so that's roughly how this thing works and now let's see some details they see it here again the optimization process produces a new checkpoint every thousand training steps if the new player wins by a margin of over 55 percent uh then it becomes the best player and is subsequently used for self-play generation and also becomes the baseline for subsequent comparisons so i mentioned that part already and it's playing 400 games to we are playing 400 games between the checkpoint and the best model in order to figure out whether it's statistically significantly uh better than the whether the checkpoint is better than the best model so far okay so now that we have the best player let's see how we can use those weights in order to play these self-play games so the best current player alpha theta star as selected by the evaluator is used to generate data in each iteration alpha theta star plays 25 000 games of self play and it uses 1 600 simulations of mcts to select each move okay um let me draw it here basically this is the self play thread and it just fetches the the newest uh weights from the alpha theta star and they say here for the first 30 moves of each game the temperature is set to 1. this selects moves proportionally to their visit count in mcts ensures a diverse set of positions are encountered that means if this is the initial state of the self-play game so that's the initial board state we play for the first 30 steps so this is 30 we play using proportional sampling that means the following so if i just zoom in into this state so this is some state state 0 in this particular case and once we do the mcts and we find the probabilities we'll be using uh we won't be doing a greedy sampling will be a greedy policy we'll just take the proportional sampling so if we have something like this maybe uh some distribution like this one that means this action will be taken the most often because it has the highest probability um so that's for the first 30 states and that ensures that we have some kind of because we're constantly using the same model this ensures we are having uh some uh diversity in the initial positions okay so that's the first part and then we start playing from this point on we start playing greedy because the temperature goes to zero and that means uh whatever the so that means this one will get transformed into this all of these will be zero so that's zero that's zero this one will be one so we'll be picking the the highest probability always okay so then they say the following additional exploration is achieved by adding very clay noise to the prior probabilities in the root node as zero so if we take certain state here and we zoom in again and before we even start building the mct mcts tree what they'll do is they'll add the noise into the root priors themselves so once the once we get the initial priors those will get by passing the the state into the uh our model into the network uh we'll modify these priors using the directly noise which will ensure there is some randomization uh in how do we visit the uh the how do you explore the mcts tree and that will explore they will additionally enable us to have a diverse set of tuples which will be fed back into the storage uh into the replay buffer which again is used by the currently best network to update its weights and that's that's pretty much it um this is something you should know from previous video basically we're storing statistics of visitation counts uh cumulative mc uh monte carlo estimates the action value function and the priors okay uh here p u ct we saw this one so basically depending on the visitation counts the more you visit some state the smaller the u becomes and basically that means uh you you depend on the action value function to decide whether you want to pick that edge or not okay additional detail positions in the queue are evaluated by the neural network using a mini batch size of eight the search thread is locked until the validation completes so this is a bit different compared to the alphago model because it used to use a synchronous policy here on the other hand what this means is basically on the gpu you have so this is a gpu you have the network and you just make a queue here um and it has eight slots and basically you need to wait for the eight threads to uh reach the final the leaf node and to push it here and then uh we'll just you know in a single batch will produce the priors as well as the values and then only then will the those threads resume so they are kind of synchronous so that's uh a detail that's that's new to alphago zero uh they also they they continue using this virtual loss which if you remember if we had some uh so this is the the root of the mcts so this is the leaf basically uh the virtual loss will make sure that these visitation counts are so these are the states and the squiggly lines are the actions uh it will just uh maybe do minus three on all of the states uh for the visitation and that will reduce the u part of the key of the of the puct which means uh all of the other threads which are building up the mcts tree will be less likely to take this exact route to the leaf node and that's that's the whole point because we're already exploring this one we just have this virtual lost thing which uh discourages other threads to explore the same path okay finally um we're using exponentiated visit count uh basically so this is the visitation counts and this is the temperature coefficient which we were mentioning so the first 30 moves are are being played proportionally and then this temperature drops to zero which makes sure that we're just picking the highest probability action okay let's summarize this the principal difference differences are that alphago zero does not use any rollouts it uses a single neural network instead of a separate policy and value networks leaf nodes are always expanded so there is no threshold we don't have to wait until maybe 40 or something which was dynamically computed so that the gpus are not starving some optimization stuff and so we just expanded as soon as we encounter a leaf node we expanded and each search thread simply waits for the neural network evaluation so it's synchronous rather than performing evaluation and backup asynchronously and also there is no tree policy which if you remember just i used to set up priors so when we get to a leaf node we used to use the tree policy which will set some fast priors and we would send the leaf node onto gpu where the policy network sl policy would wait and it will it will do an inference and calculate the actual priors which would then get swapped here but here so but so this is basically a placeholder network and we don't have it anymore so much less details actually it's it's an easier algorithm we're just using reinforcement learning and we got better results okay hopefully you got something out of this and now i just want to make it uh clear what the differences are between this model and the alpha zero so that's just a small additional step if you ask me um i don't think there was anything significant that happened there okay here here here is it basically um there's just a couple of differences and i don't quite understand the first one so they say alpha zero instead estimates and optimizes the expected outcome taking account of draws or potentially other outcomes uh so this part bugs me because um uh the alphago zero uh estimates the value so the value function estimates between plus one and minus one which where plus one means this player is going to win and minus one means we're certain we're going to lose um and so i'm not quite sure what they mean by that if you know please leave leave a comment down there and the second one is obvious so the rules of chess and shogi are asymmetric so in general symmetries cannot be assumed so that's something i already mentioned we wanna get rid of that of that of picking uh like an element from the dihedral group one of those eight uh reflections rotation combinations so we just wanna ditch that and uh they showed that they won't hurt the performance at all uh so yeah um a little bit but we can compensate by compute okay and the third thing is in contrast alpha zero simply maintains a single neural network that is updated continually rather than waiting for iteration to complete self-play games are generated by using the latest parameters for this neural network i'm getting the evaluation step and the selection of best player so that means this time we have this network and instead of doing these periodic checkpointing and comparing for the best model we are continuously using this same weights and we are updating them using the uh replay buffer and the threads are just filling that buffer and we are just taking the data and we are continually updating this agent and when a new thread resumes so this one finishes and wants to start over again it will just fetch the newest uh the current the only pair of weights we are we are continually updating okay so that's the third difference so not a lot there and okay and um you can see the results uh here basically on chess uh they compared against the stock fish that was the best engine at the time and you can see after uh some time it gets better than stock fish finally we have shrogi here in a small amount of time the model gets better than shogi which is a japanese chess affectionately known as japanese chess and uh finally we get better from alphago zero uh after some training as well um so what's interesting on these three charts if you ask me is this one so you you can see how small the gap at the elo gap here is and that just kind of indicates the amount of effort like decades of research of chess because chess was considered as the drosophila of ai because so many people so many researchers uh spend so much time uh researching it and so there are so many good heuristics and a handcrafting that went into creating this stockfish engine that it was kind of hard to get much better than that so that's that's funny okay the only thing that's actually um not so uh game agnostic is this so they are still adding the noise uh to the prior policy to ensure the exploration so directly noise i mentioned in the alphago zero and it's uh scaled in proportion to the typical number of legal moves for that game type so that's something that's game specific other than that this is a pretty generic algorithm and yeah uh thing worth noticing here is that you have to train a single uh instance a single agent for every specific game so that means this still doesn't generalize uh as much as we'd like it to so optimally controlling for alph for go and then just kind of fine tune it for chess and it would be really good but that's not the case and you have to train it from scratch pretty much to the best of my knowledge um one thing worth noticing as well is that alpha zero searches just 80 000 positions per second in chess and 40 000 in shogi compared to 17 million for stock fish and 35 million for ilmo so that just shows uh that this approach is much less brute force and these two the the ilmo and stockfish are much more similar to deep blue which used a brute force algorithm to beat gary kasparov back in 97. so they say arguably a more human-like approach to to search and to to playing the games and i agree um so that's that's pretty much everything you need to know from alpha zero paper once you know uh once you understand the alpha go and alpha go zero so ditch the symmetries keep continually updating the agent um and then you just a kind of map for the specific rules you can you kind of do some small adaptations you apply the same algorithm and as you can see after some training time uh you can get you can achieve state of the art on all the three benchmarks uh that was it for this video if you have any feedback whatsoever on the things i could improve please feel free to just comment down in the comment section and i'll read those and you know the drill just hit that subscribe button hit the bell icon to get notified and until next time keep learning deep [Music] you

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ In this video I cover AlphaGo Zero (and AlphaZero), an agent that learned, through pure self-play and zero human knowledge, to beat all of the best human players and algorithms in Go, Chess, and Shogi. You'll learn about: ✔️AlphaGo Zero (Mastering the game of go without human knowledge) ✔️AlphaZero (A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ AlphaGo Zero paper: http://augmentingcognition.com/assets/Silver2017a.pdf ✅ AlphaZero paper: https://arxiv.org/abs/1712.01815 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 - AlphaGo lineage of agents 02:35 - Comparing AlphaGo Zero with AlphaGo 06:50 - High-level explanation of AlphaGo Zero inner workings 10:20 - MCTS recap 12:00 - Training details and curves 15:10 - Architecture impact 17:30 - Knowledge acquired 20:55 - Results 22:05 - Discovering joseki 23:40 - Human domain knowledge in AlphaGo Zero 25:30 - Pipeline overview 28:40 - Self-play thread explained 31:55 - Further details (PUCT recap, etc.) 35:50 - AlphaZero (what's new?) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 36 of 60

1 Intro | Neural Style Transfer #1
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
2 Basic Theory | Neural Style Transfer #2
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
3 Optimization method | Neural Style Transfer #3
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
4 Advanced Theory | Neural Style Transfer #4
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
5 Anyone can make deepfakes now!
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
6 What is Computer Vision? | The Art of Creating Seeing Machines
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
7 Feed-forward method | Neural Style Transfer #5
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
8 Alan Turing | Computing Machinery and Intelligence
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
9 Feed-forward method (training) | Neural Style Transfer #6
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
10 What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
11 Semantic Segmentation in PyTorch | Neural Style Transfer #7
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
12 How to get started with Machine Learning
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
13 How to learn PyTorch? (3 easy steps) | 2021
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
14 PyTorch or TensorFlow?
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
15 3 Machine Learning Projects For Beginners (Highly visual) | 2021
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
16 Machine Learning Projects (Intermediate level) | 2021
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
17 Cheapest (0$) Deep Learning Hardware Options | 2021
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
18 How to learn deep learning? (Transformers Example)
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
19 How do transformers work? (Attention is all you need)
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
20 Developing a deep learning project (case study on transformer)
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
21 Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
22 GPT-3 - Language Models are Few-Shot Learners | Paper Explained
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
23 Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
24 Attention Is All You Need (Transformer) | Paper Explained
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
25 Graph Attention Networks (GAT) | GNN Paper Explained
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
26 Graph Convolutional Networks (GCN) | GNN Paper Explained
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
27 Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
28 PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
29 OpenAI CLIP - Connecting Text and Images | Paper Explained
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
30 Temporal Graph Networks (TGN) | GNN Paper Explained
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
31 Graph Neural Network Project Update! (I'm coding GAT from scratch)
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
32 Graph Attention Network Project Walkthrough
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
33 How to get started with Graph ML? (Blog walkthrough)
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
34 DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
35 AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
37 OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
38 MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
39 EfficientNetV2 - Smaller Models and Faster Training | Paper explained
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
40 Implementing DeepMind's DQN from scratch! | Project Update
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
41 MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
42 DeepMind's Android RL Environment - AndroidEnv
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
43 When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
44 Non-Parametric Transformers | Paper explained
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
45 Chip Placement with Deep Reinforcement Learning | Paper Explained
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
46 Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
47 Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
48 GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
49 VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
50 VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
51 Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
52 Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
53 AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
54 RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
55 DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
56 DETR: End-to-End Object Detection with Transformers | Paper Explained
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
57 DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
58 DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
59 Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
60 Fastformer: Additive Attention Can Be All You Need | Paper Explained
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany

The video teaches how AlphaGo Zero and AlphaZero achieved state-of-the-art performance in Go, Chess, and Shogi using self-play and reinforcement learning, and how these agents can be applied to other games and tasks.

Key Takeaways
  1. Train a neural network using self-play and reinforcement learning
  2. Implement Monte Carlo Tree Search with PUCT algorithm
  3. Use ResNets instead of plain CNNs
  4. Optimize neural network parameters using Bayesian optimization
  5. Evaluate players and generate new self-play data
💡 AlphaGo Zero and AlphaZero achieve state-of-the-art performance using a single neural network and self-play, without requiring human expertise or domain knowledge.

Related Reads

📰
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
📰
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
📰
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
📰
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning

Chapters (14)

AlphaGo lineage of agents
2:35 Comparing AlphaGo Zero with AlphaGo
6:50 High-level explanation of AlphaGo Zero inner workings
10:20 MCTS recap
12:00 Training details and curves
15:10 Architecture impact
17:30 Knowledge acquired
20:55 Results
22:05 Discovering joseki
23:40 Human domain knowledge in AlphaGo Zero
25:30 Pipeline overview
28:40 Self-play thread explained
31:55 Further details (PUCT recap, etc.)
35:50 AlphaZero (what's new?)
Up next
Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom
SumanTV Classroom
Watch →