AlphaGo Zero
Skills:
Agent Foundations90%Tool Use & Function Calling80%Multi-Agent Systems70%Autonomous Workflows60%
Key Takeaways
AlphaGo Zero uses self-play training with Monte Carlo tree search and a residual neural network to achieve state-of-the-art results in Go, outperforming AlphaGo Lee and AlphaGo Master. The algorithm combines policy and value networks into one neural network and uses a loss function that combines the difference between the Monte Carlo tree search distribution and the neural network's distribution, and the difference between the actual return and the predicted return.
Full Transcript
thanks for watching this explanation of alphago zero alphago zero is different from alphago because it uses less prior encoded knowledge whether it's in the form of supervised learning from human expert moves or in the input state representation to the neural network alphago zero also uses a single residual neural network for the policy and value network and uses a very interesting extension to the self play training by emulating the Monte Carlo tree search this video is in the series of going from alphago to MU 0 inspired by the cago connect x competition this video will explain the follow-up to alphago alphago 0 whereas alphago is seated with supervised learning of human expert moves so you can train his policy network to exactly mimic the human experts and that is how they start off the training of alphago alphago 0 starts from scratch with random weights and trains itself purely through self play the self planned alphago 0 is different from an alphago because it uses the Monte Carlo tree search which is a way of getting a better action by doing the simulation where you look ahead and see what the better actions would be and alphago 0 uses this to sort of imitate the Monte Carlo tree search so the Monte Carlo tree search produces a better distribution over actions than the policy Network originally has and they use this mapping for the Monte Carlo tree search in order to train the policy network also interestingly in alphago's 0 is that they combine the policy network and the value network into one neural network in this case rather than just a simple convolutional neural network they're using the residual network so these skipped connections and then the batch normalization also in this paper they're using a input representation that just contains of the input the board rather than having these handcrafted features so it's just the board state and then having the last eight concatenated together so it has a little bit of a sense of history because alone the alphago the go board doesn't obey the markov property where the state is all you need to predict the future so you have to append some of the previous positions of the board in order to make it so the currents a representation can predict the future this list from their paper provides a quick overview of the changes from alphago to alphago 0 the first is that has trained solely by self by reinforcement learning whereas the alphago algorithm is originally seeded by doing supervised learning with policy network said it maps the states to actions exactly the same as the expert moves in this kg/s data set so for this reason it's called tabula rasa or blank slate meaning that it's like a blank slate algorithm compared to being biased with some prior encoded knowledge so then the next change is that it only uses black and white stones from the board as input features so alphago has this first 19 by 19 feature plane that is the board state and then it has these 47 handcrafted features feature planes as well as the board so it's using a lot of handcrafted features in the alphago algorithm compared to alpha go zero the next change is that it's a single neural network so there isn't a separate policy and value network rather it's kind of like a multi task network they share the same feature extraction base then you have separate heads for policy and value network so you have like a three layer policy network and like a two layer value network so it shares the same feature extraction and then it has separate heads that map States to actions and then a separate head that predicts the winning probability from that state the fourth is that it has a simpler tree search that reply it relies upon the single neural net to evaluate positions and sample moves without performing any Monte Carlo rollouts so in the previous alphago we had this lightweight rollout policy that we used to when we got to a leaf node in our Monte Carlo tree search we'd send that rollout policy all the way down to the end of the game and then we'd get that Z sub L whether we won or lost in the rollout policy and use that to update the you know the max values in the Monte Carlo tree search so in this case what we're doing instead is we're incorporating look-ahead search inside the training loop so it's going to happen is we have the policy network and the value network are now at the same neural network so they're going to be used in the Monte Carlo tree search and then the Monte Carlo tree search is distribution over actions is going to be used to update the policy network so the money Carla tree search is seen as a policy improvement algorithm that you stack on top of the policy Network it produces a better distribution over actions by doing that look ahead search so you're going to use that look ahead search and the results of that to form this new data set that's going to be used to train the policy network and the details of that we'll get into in the video the input representation in alphago 0 is a bit more impressive than alphago because it doesn't use any of the handcrafted features what it does is it has these 19 by 19 feature planes which represent the status of the board so what you have here is your concatenate in the previous eight states so you have say this is the current board representation and then say maybe this feature plane would be like the board position four moves ago and so they explained that this is because you need this history in order to make legal moves in the way that the game go works the next interesting detail from alphago 0 2 compared to alphago is that whereas alphago used a simple convolutional neural network alphago zero is going to use a residual neural network as formulated in this way described in their paper so using things like the skipped connections having the batch normalization layers the Braille OU's and they use the same shared feature extractor and then they're separated out into these two separate heads so the policy head has this architecture where it's you know mapping the state to the action and the value head has this architecture where it's mapping it into the scalar the probability of their like the prediction of the reward given the current state so this trend this alphago 0 starts this trend that's now continued and things like the hide-and-seek agent from open AI and dota 2 and all the stuff whether they're using these much more complex neural networks to map state to actions and do value prediction in these kind of game playing a is this plot shows an ablation that assigns a contribution of architecture comparing the new residual neural network used for playing go compared to the convolutional neural network from alphago the purple chart shows the dual resonant so dual versus separate in these plots means that you have one neural network that does policy and value estimation compared to separate having separate policy Network and separate value network to see in the e low rating metric which is kind of like their version of accuracy it's like the metric of higher is better and it's the ultimate metric the purple score the dual resonate sharing both tasks performs the best and you have a similar performance when you have this separate policy and value network but both sharing the residual architecture compared to the convolutional network that has that is doing both policy and value estimation it's kind of interesting to think that maybe like the multi task of doing policy and value in the same shared feature extractor is improving upon the ability to do this then in the blue you have the separate convolutional net work the way that alphago is constructed so comparing the light blue or you have separate policy separate value compared to alphago zero which is the purple having the policy and the value network together in this resonant architecture these are the two plots are something that we'll see again in the presentation it's using this network to go back to that kgs human expert data set and predict things like the moves that they're gonna make in a given state so you see that in this case they're about the same with respect to predicting but move the expert is gonna make given a state and then you see that actually the alphago algorithm or actually started this is a mean squared error plot so it's it is better at predicting the value of the state in the human game data set the core idea of alphago zero is that it gonna use Monte Carlo tree search as a policy improvement algorithm in order to construct the data to improve upon the policy in the self play training loop so the Monte Carlo tree search is kind of this heuristic tree search that's used on top of making decisions with a policy network and a value network so you do is you start off with an initial state and then you assign these weights to each edges so these edges represent potential moves that you could take from a given positions to say it's like place a black tile here or a place of lifestyle here from this starting state so you have this Q plus U and if you want the details of that you can check out video I made on alphago or you could read the alphabet paper as well as off goes you're both described how they do this basically it's the value network estimate plus some weighting of the probability distribution of the actions assigned by the policy Network and it's weighted by how many times you visit the node so that you visit it less often and encourage exploration it also have some kind of like an epsilon greedy where you don't always take the max sometimes you you know with probability epsilon and you take the max decision and then one minus Epsilon you choose randomly amongst the edges instead but what you do is you traverse down the street and you reach a leaf node where you then use the policy network to construct a new node and then you would back up the data from the new node to the bun E Carlos search tree and you would iterate on this so the Monte Carlo tree search uses look ahead to get a better sense of the action to take at the next step by looking all the way to say eight levels deep you get a better sense of the move to take right now and that idea is key to how you trained alphago zero by using the data found by these Monte Carlo simulations you try to put this directly into the original policy so that I can clone the results of running a Monte Carlo simulation state so the idea is that as we're doing self play we're running Monte Carlo tree search simulations from each state by running one point six thousand simulations meaning one point x at the one point six thousand times that the Monte Carlo tree search goes and tries to find a new leaf node to expand and then we use this to Train our original policy network so the way that we do this is we take the distribution over actions that the Monte Carlo tree search decides upon and we use this as learning data for our loss on our policy slash value network so this is the loss function that's used to update the neural network so what we have is the V the P and the V come out from our neural network from the original state the V being the value of estimate the guess of we know what reward we're gonna receive from the state and the P is our distribution over actions you know what moves we want to make from that state so we're gonna update this is we're going to multiply the PI transpose times log P this denotes the Monte Carlo distribution is this pi and the peas distribution from the neural network so basically we're penalizing it from being much different from the Monte Carlo Search's distribution over actions and then we're subtracting the Z minus V Z being the actual return that was experienced in this episode and V being the prediction from the neural network so we combine both of these terms in the same loss function then you add this C of the l2 norm of the weights just kind of like a weight regularization term but these two are the heart of the algorithm in the paper the authors first report the results of training a 20 layer residual neural network for three days in this three days of training they get 4.9 million games of self play which each move in the self play loop having one point six thousand simulations for the Monte Carlo tree search amounting to about half a second of thinking time per move thinking time describing this simulation in the Monte Carlo tree search so in the parameters in the neural network the this 20 layer residual network with the two separate heads are updated from the 700,000 mini batches these pairs of state the policy produced by the Monte Carlo tree search and then the result that ended up happening in that full full game sequence so then you have the alphago 0 outperforms the alphago lee from the original paper after these three days of training and over 36 hours is about one half days and that alpha goalie algorithm had been trained over several months so then after 72 hours of training which is three days the alphago zero defeats alphago a hundred to zero so also interestingly is that it's run on a single machine with forty P use compared to I think it's like eight GPUs and forty CPUs or something like that that runs the alphago algorithm the authors of the paper then scale up alphago zero from twenty blocks to forty blocks and this plot in their blog post it's linked in the description of this video shows the plot of how it improves its élow through the self play and this kind of a training loop over time in this alphabet zero algorithm the authors scale up alphago zero to a larger network from twenty to forty residual blocks and have forty days of training time amounting to twenty nine million games of self play and then they have 3.1 million mini batches of these 2048 positions this st pi sub T and then Z T used to update the policy invaluable of 0 so the scaled up alphago zero one eighty nine to eleven verse alphago master so you frequently hear this reporting on a alphago zero that is a hundred to zero versus the previous alpha build paper and that is correct but is not a hundred to zero completely on this idea of having known prior human knowledge so the alphago master is the same idea of alphago zero which is described in the paper as well different versions of alphago fan alphago lee alphago master and alphago zero but basically alphago master still uses the supervised learning of the kgs to initialize the network and more details that you can read about in the paper i wanted to relate this idea of self play training with the Monte Carlo tree search as well as knowledge distillation it's kind of interesting how you have the Monte Carlo tree services distribution of the policy and then you try to distill this into the original policy Network which i think is really similar to the knowledge distillation pipeline where we have these large capacity models distilling information into the student lower capacity model by emulating some kind of a label distribution so it's definitely an interesting kind of trend and connection across different disciplines of you know AI research thanks for watching this explanation of alphago zero some of the big takeaways from alphago zero compared to alphago is the use of less prior information you don't start off with a supervised learning of the human expert data and the input state representation is just these history stacks of the border presentations compared to these handcrafted features it's also really interesting to see the policy and the value networks combined into this one residual neural network so it's got more complexity in the neural network and is performing both tasks in one architecture alphago zero is also really interesting for the way that they use the Monte Carlo tree search in the self play loop in order to improve upon and train this neural network I hope the oldest was clear in the video thanks for watching and please subscribe to Henry AI labs more deep learning in AI videos
Original Description
This video explains AlphaGo Zero! AlphaGo Zero uses less prior information about Go than AlphaGo. Whereas AlphaGo is initialized by supervised learning on human experts mappings from state to action; AlphaGo Zero is trained from scratch through self-play. AlphaGo Zero achieves this by combining the policy and value networks into a single, residual neural network. AlphaGo Zero also enhances the self-play training loop with MCTS and uses the MCTS action distributed to train its own policy / value network!
Thanks for watching! Please subscribe!
Paper Link:
DeepMind Blog Post:
This video is a member of the series "From AlphaGo to MuZero" covering the progression of DeepMind's board game agents, inspired by the Kaggle Connect X competition!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: Agent Foundations
View skill →
🎓
Tutor Explanation
DeepCamp AI