MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany · Beginner ·📄 Research Papers Explained ·5y ago

Skills: Agent Foundations90%Multimodal LLMs80%Research Methods80%Tool Use & Function Calling70%

Key Takeaways

The video explains the MuZero algorithm, a reinforcement learning model that achieves superhuman performance in Atari, Go, Chess, and Shogi games without any knowledge of the underlying dynamics, using a tree-based search with a learned model. It also discusses the model's architecture, training process, and performance in various games.

Full Transcript

finally i'm going to cover mu0 the latest agent in the lineage of alphago agents in this paper called mastering atari go chess and shogi by planning with a learn model and again that's the deepmind crew that's been working on this uh since 2014 or something and uh basically uh be really helpful if you already watched my previous videos on alphago alphago zero and alpha zero because many of the details still apply here so do watch those i'll link it somewhere here um okay before i start digging into the paper let me just quickly walk you through the what happened with this lineage of agents and the first thing that happened was uh alphago back in 2015 had to develop this agent which would solve the game of go but the trick was it used a lot of human data so the professional games played by go players and it had a lot of ghost specific heuristics integrated into the agent itself and it also knew the rules of the game obviously and then they had alphago zero which basically uh also played go and was even better than alphago but the trick was they didn't use any uh human data they were just using pure self-play reinforcement learning and uh the algorithm the agent learned how to play by itself and became the best player ever so far the next step was doing the alpha zero which uh basically was a minor let's say minor probably a lot of engineering but a minor conceptual change so they had to prune a couple of more details from alpha go zero in order to make it uh general enough so that it can play not only go but also chess and shogi which is a so-called japanese chess and again we didn't have any human data but we did have the rules and the latest agent in this lineage is mu0 which is this paper and basically um they just stitched the rules and i'll explain in a minute what that means and it can also play atari games now so the atari 57 games in the atari benchmark so when they say they don't have the rules what does it mean what does that mean well basically that means that during this so if you watch my previous videos you know what monte carlo research is and basically what it means is that during the monte carlo tree search you don't have the available simulator so you don't know how to uh given some action so starting from a root state as zero given some action you don't know what the next state here will be you don't know how the exact layout of the boards will look like in the case of board games or in the case of atari you don't know what the next screen will look like and basically because you don't have the simulator you'll have to learn uh that dynamics model and we'll see and that's the whole main paper the main idea of this paper is to learn the dynamics model and then subsequently use it in order to plan in order to play all of these games so let's see how they managed to pull it off so they said here we present the mu00 algorithm which by combining a tree based search so again monte carlo tree search with a learn model that's the novelty uh achieves superhuman performance in a range of challenging and visually complex domains without any knowledge of the underlying dynamics so that's what i mentioned already and you'll understand that as the as as this video progresses uh so vignette then evaluating on go chess and shogi without any knowledge of the game rules mu zero matched the superhuman performance of the alpha zero agent so that's this one um algorithm that was supply with the game rules and it also achieved state-of-the-art on the this atari uh 57 benchmark and that's pretty awesome as well okay so let's let's dig into the algorithm uh straight away and i'll skip this part for now and let's just explain how everything looks like in mu0 okay so first let's understand the algorithm itself uh first things first the um the mu0 agent consists out of three parts so the first part is the representation function h the second function is the prediction function f which will give us the policy and the value function and finally we'll have something called g function or the dynamics function and given the state and given the action it will just output the next state and it will output the reward okay and this is what we're trying to learn here so we don't have the simulator the perfect simulator we have this dynamics model which we're trying to learn and so these three components together comprise the new zero model okay so uh this is the input state and the thing is because we have atari we have go jaz and shogi and i'll just give you example of how this thing looks like for go and how it looks for atari uh as two representative cases so for go this thing whoops will look like this so we'll have eight planes uh and each plane will have 19 by 19 resolution because that's how the go board looks like so this is going to be so 8 and the reason we have 8 is because in order to play go you need to have some past observations otherwise you can't play the game okay so for atari this is going to look like so similarly we'll have 32 x 3 because we have rgb frames we'll have to encode 32 frames and so this is going to look like this and that's like i think 96 by 96 resolution and we'll have additionally 32 planes uh for the 32 previous actions and that's also needed because in order to play atari you just need those actions encoded and we just we're just going to broadcast for example if the action was zero you're just gonna put a bias plane with all zeros uh stacked here if it was one you're just going to input uh one over 18 because atari has 18 actions so that's how they encode the actions as just uh constant bias planes uh here so once we have those uh volumes those represent the input states and now we have the h function which is a simple resnet so you'll just pass that volume into a resonant and out comes uh the hidden representation the hidden uh state and it's going to look similarly it's going to have 256 planes and depending on whether you have atari or whether we have go or whatever for atari is going to be six by six resolution and for go it's gonna remain the same so 19 by 19 because again the board has 19 by 19 resolution okay and now we just supply monte carlo research in the following manner we have the function which is the prediction function which is also a convolutional neural network and we just pass this hidden state into it and out comes the policy so the policy will be like uh 361 uh like uh dimensional vector in the case of go and the value would just be the the value function which gives you the uh the the expected reward uh you'll get going from this state to the end of the game just basic reinforcement learning stuff um so that's the uh that part and using those uh uh priors using those probabilities coming from the policy that's gonna be your prior in your monte carlo research if you remember from the previous videos and you're just going to take the maximum p like a puct value and you for example you took this branch and once you have um that action you're you're going to pass it into the this g function so let's see how that functions so again example uh i explain we have this hidden state and it looks like this so 256 there and this is dependent on the on the game itself so this is 256 and what we're going to do is uh we're going to encode this action again spatially so as i previously said for example for atari you're just going to add a bias plane here and this is what what's going into the g function and uh out goes the the next hidden state which will again have 256 planes and out goes the reward here so just by doing this and doing the simulations through the monte carlo using this uh g model the the dynamics model and the the prediction and the representation models you'll build up the monte carlo research and then as you know you just take the the highest visitation count in the route and that's your next action and so because you can see here you just pick the maximum visitation count action and you pass it to the environment the environment gives you some reward and it gives you the next observation so basically you'll just in the case of say go because those are eight eight boards you'll just pop the the oldest one and you'll just append the newest board state that came from the environment and that's your new input state and then you just recursively repeat this until the end of the game and then the environment tells you okay this is uh environment will just stop you and what you'll do is you'll be saving those trajectories into this centralized replay buffer where all of these actors are going to store their experience so basically mu0 is a distributed agent that means you have multiple actors all of these are maybe separate threats and they'll be executing this thing here they'll be collecting experience they'll be saving stuff like this thing like the the policy coming the refined policy from the monte carlo research and rewards and states uh they'll be saving those tuples here and later you can use those to train this thing now there is one thing that may confuse you and that's how this thing how does this thing play without any rules and the answer is like the partial answer and we'll dig into it a bit later is basically uh only in this root state you can query the environment and ask what the valid actions are so as you can imagine for for example in the case of go you'll have 361 branches here and not every single one is going to be illegal so once you communicate with the environment it tells you maybe these actions here are invalid that means you'll just pull the priors the probabilities to zero and you'll take that probability mass and you'll just proportionally distribute distributed across all of the other branches and uh so doing that uh you'll basically be building the monte carlo tree search where the root node will kind of constrain you uh constrain your space but all of the other nodes here like all of these they'll have all of the 361 actions available in the case of go and that means uh you you may play some initially you'll be playing maybe some impossible games illegal games but that doesn't matter eventually during training the the the the the model will learn how to uh basically do the correct steps it will learn the rules of the game basically okay uh before explaining the actual uh training procedure let me just further clarify and uh how this how it works without any rules and that's the part that bugged me a lot initially so let me try and explain you so first of all specifically alphago zero and alpha zero use knowledge of the rules of the game in three places so state transitions in the search tree actions available at each node of the search tree and episode termination within the surgery so there are three parts understanding how this thing works without any rules first things first this one so alpha zero had access to a perfect simulator of the true dynamics process so this is how the thing looks like you basically have um so this is the root of the monte carlo research so this is your initial state uh you start building the monte carlo tree search like this and what's the difference well i already explained how the input state looks like so in the case of go you have the eight planes 19 by 19 special spatial resolution and once you take some action in the monte carlo research what you'll end up here is um again the basically the input observations and not some uninterpretable um a hidden state like in the case of mu0 so you'll end up here with again with some eight planes uh and you have 19 by 19 uh basically uh boards and that's that's here and again once you play some other action here a prime again here you'll have the interpretable result because you have the rules because you have the simulator you can do that in the case of mu0 you don't have that so here you'll have instead of eight you'll have those 256 planes and you don't have any interpretation of that uh like hidden volume there's just some state that the mu00 learns how to build up uh so that you can predict the policy the value and the rewards so that's what i want you to understand that's the first thing the second part is and i already kind of explained this museum only masks legal actions at the root of the search tree where the environment can be queried but does not perform any masking within the search tree itself okay so what does that mean basically in the case of go again you have 361 actions going from the root node you'll mask some of those out because you can query the environment and that means maybe you mask some portion here and you'll you'll have this part available for simulations and you'll be building the tree but the thing is all of the other nodes here will have all of the actions available because you have you don't have the simulator and so that's the second part and the third part is the terminal nodes so inside the tree the search can proceed past a terminal node in this case the network is expected to always predict the same value this is achieved by treating terminal states as absorbing states during training so what they say here is the following so because once you have the simulator once you arrive at a certain state the terminal state the similar would just tell you hey this is the terminal state you you can take the reward you just back prop that back up the reward up the monte carlo research and that's it that's what we've done what we've done in alphago in alphago zero and alpha zero here you can actually go past this state because you don't know that it's a terminal state and that's kind of a problem and uh they mentioned here that's somehow and i don't understand this part really well like it's kind of confusing they didn't did they didn't explain it really well in the paper uh how how they enforce that the value function always predicts the same value yeah so if you maybe know the answer just comment down below but for now we can just treat it as a black box and uh continue with this video uh let me just now explain you the that training procedure itself okay so we've been saving these experience replays into the replay buffer and now what we do is we sample one of the real sequences so this is one of the real sequences we've stored and you can see the input states and you can see some actions and what we do is we take one state so the first state we pass it into our representation functions of the resonant and we get some hidden state here okay what we do is we pass it into the the prediction model we get the policy we get the value then we uh basically take the section that we have from the replay buffer and we concatenate it spatially again with the hidden representation and we pass it into the g function the the dynamics model and out comes the reward and out comes the next state we repeat the process again so we pass it through the prediction function we get the policy we get the value we take the action again we concatenate it with the hidden state we pass it into the g the dynamics model and again we get uh the hidden state et cetera et cetera so basically um they they've been using these uh and they said somewhere here um so in each case we train mu zero for k five hypothetical steps so that means they'll have even though you can see only three here they basically in practice use five of these steps and um now once you have this uh this is how you train the thing so you make sure that this policy here is the same as this one and you store you've stored this one in the replay buffer so you can use it as the target and just apply a simple cross entry below so you minimize you minimize the kl divergence you make those two the same and uh that's one component of the loss the second component you want to make sure the value function uh becomes as close as possible to the outcome of the game so in the case of go uh the game will after you roll this out at the end you'll either win and get plus one you'll have a tie so zero or you'll lose that's minus one and you'll make you'll make sure to and let's just kind of uh denote this is that and you'll just want to make sure that v minus z squared gets to zero and that's simple mean square error uh loss and finally you do the same thing for your words except that uh in board games such as go because the rewards are always zero until the very end you actually won't be using this component so that this thing will be used for atari where you have rewards along the way uh but just remember there's a thread component and that's pretty much it there is some regularization of the parameters and um that's pretty much it so as you can see uh by doing by putting the loss here and here you'll be modifying and this is some as you can see recurring process so whatever came here is going to influence the values here so this you can treat this as a rule like the recurrent neural network even though they are not using recurrent neural network here at basically they'll be tweaking all of these parameters the parameters of the the dynamics model the parameters of the uh prediction functions and the parameters of h there are the representation function to learn this model and eventually uh you'll have a pretty decent uh dynamics model which basically now contains the rules of the game and that's awesome um so the the the part that's maybe confusing is it as you can see here what keeps this this hidden state which doesn't have any interpretation um correlated to this input real state that we sampled from the replay buffer is exactly the loss so by making sure that the uh by making sure that the prediction function from this state gives you valid policy and value uh eventually this state will have something to do with this state even though they didn't uh enforce any constraints upon this hidden state so you can't reconstruct the actual observations going from this state that's not how they train it and yet it works because the only the only thing we want to encode into this is we want to be sure we want to make sure that we can actually get the policy get the value and get reward from these states and that's that's that's it and they mentioned that here so at every one of these steps the model predicts the policy the value function and the immediate reward the model is trained end-to-end with the sole objective of accurately estimating these three important quantities so as to match the improved estimates of policy coming from the monte carlo research and value generated by search as well as the observed reward there is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation so you can go from the hidden state to the actual input observations and that's not needed because we only care about the three quantities finally instead the hidden states are free to represent state in whatever way is relevant to predicting the current and future values and policies intuitively the agent can invent internally the rules of dynamics that lead to most accurate planning and that's what's so awesome about new zero basically uh by doing this process here you learn this model and it just works they say here again no semantics attached to the state you just predict these and it's just solved all by itself uh here is the um just the loss function again basically uh this is the cross entropy i mentioned so this is your pal the raw policy this is the monte carlo policy you want to make sure those are the same uh just a simple um like a mean square error here for the value function and for the reward you also do msc between the reward you got and between your output from your dynamics model g finally the regularization so theta just comprises out of theta f plus theta h and finally you have data g that's the mu00 model and you just do l2 regularization that's the final loss you have five hypothetical steps you just train it on the bunch of trajectories and uh we can see now the results uh let me show you this basically uh comparing uh mu0 on chess so this is y-axis this deal score number of training steps you can see that this is probably stock fish so it got better than stock fish at one point of time then we have for shogi similar and finally the the game of go here you can see it gets better than uh alpha zero and uh finally on atari uh they took something called rt d2 agent and this is the the the mean value this is the median and you can see that the mu0 achieves a higher median and mean scores than the state-of-the-art uh so that's awesome now uh there is one one catch here and that's that actually on on some of the tough games like uh like montezuma's revenge and pitfall and solaris etc uh you'll have zero score pretty much with mu0 so if you take a look at the distribution across all of the 57 games so this is these are the 57 games here it will have like a really high score for some games and then we'll just dip here and for the last like five or seven games the score is going to be really low uh but you can't catch that using just these two statistics like mean and median you need to show the fifth percentile etc so like an agent called agent 57 later solved this thing and achieved much better results here but you actually had low results here so it's a more general agent but it still had lower mean and median scores than mu zero okay so those are some results uh and i mentioned it here and now let me just show you uh some more results uh uh basically this is for when you have a bunch of data like 20 billion frames you can see mu0 achieves much better results than r2d2 uh apex is also some distributed agent and when they use 10x less data and something called mu0 reanalyze which learns how to better basically has better data efficiency because it can reuse some of the uh trajectories from the replay buffer and uh train multiple times on those trajectories uh by using the freshest weights uh and they show that they they get much better data efficiency again the scores are lower but it still achieves state of the art with less data and that's awesome um that's pretty much it there is a couple more things i want to mention and that's here so for the game of go uh basically after training uh the agent with 800 simulations for every single a monte carlo research so when you do those rollouts you use 800 simulations and that takes around 0.1 seconds and they can they can show that by using more but giving the the agent uh in the evaluation but giving it more search time so more simulations uh it it uh achieves much better elo which means it generalizes to to surgeries that it hasn't encountered during the training and that's awesome you can see that uh it until like two orders of magnitude uh it still has pretty decent uh increase in yellow score uh the same thing doesn't apply for atari because atari they speculate that the dynamics model the g model is harder to learn there so you can imagine why because you have to kind of understand how to transition the frames which are it's tougher problem than than analyzing the the like the board states uh and basically they train it for with 50 simulations and you can see it does improve until 100 but then it just kind of flattens and just starts slowly starts to actually uh go down so yeah so atari is a bit different uh here nothing interesting basically the more simulations the better results as we can expect uh that's that's pretty much it let me just see whether there is something more interesting to tell you here yeah there is a small difference in the search algorithm because now we have rewards along the way and not just on the term in the terminal states like in the case of war games and because of that they had to introduce this uh as you can see this this notion of of this cumulative reward and they say here the backup is generalized to the case where the environment can emit intermediate rewards have a discount gamma different from one and the value estimates are unbounded and basically now once you uh as you can see here this is familiar so your visitation count and your uh this is the the the action value function uh the the way you're now updating it is by adding this uh g estimate and that's just a sum of discount rewards plus the bootstrapped value here so yeah so in the case of uh board games these would all be zero pretty much and you'll just have the bootstrap from the leaf node and you'd propagate that one up the up up the tree so you do the the you backed up those values but here you you have all of these in case of atari that also has some consequences that we have to now normalize the uh the q values um because um as i said rewards are you have intermediate rewards we can which can be unbounded so you have to do this normalization in order to uh have this thing interact really well with the other part of the puct algorithm so basically you have as you remember we have the queue and you have this prior times times the visitations so the more you visit something that this will get uh higher so this will get lower and basically finally you'll just rely on the q value so you want to make sure that this is of similar uh scale as this thing so you just have the normalization by uh saving the max and min uh q values throughout the monte carlo research you'll be logging that statistic and you can use it here to normalize your queue so just a couple of technical details but hopefully you got the gist of the algorithm i think it's pretty neat uh basically they achieved really good scores so again it'd be really nice if in my opinion if if we could have uh the the the scores be uh as close uh like even better than agent 57 that would be really impressive um but yeah uh those games require long term planning and um mu0 just couldn't solve it uh with this approach maybe that's the forward like future research direction that's it hopefully you enjoyed this video if you did leave a like subscribe share and see in the next video you

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ MuZero - the latest agent in the lineage of AlphaGo agents. Zero human knowledge, zero rules, and cracks not only Go, Chess and Shogi but additionally it achieves SOTA on the Atari benchmark. You'll learn about: ✔️ How can MuZero learn to play without the rules ✔️ How does it learn the dynamics/model ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: https://arxiv.org/abs/1911.08265 ✅ Blog: https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 Overview of the AlphaGo lineage 03:00 MuZero actors explained 11:10 How can MuZero work without any rules? 14:50 MuZero learner explained 21:15 Results 25:15 Update to the search algorithm ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ Huge thank you to these AI Epiphany patreons: Petar Veličković Zvonimir Sabljic ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY: Discord ► https://discord.gg/peBrCpheKE 📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER: Substack ► https://aiepiphany.substack.com/ 💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS: GitHub ► https://github.com/gordicaleksa 📚 FOLLOW ME ON MEDIUM: Me

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 38 of 60

← Previous Next →

Intro | Neural Style Transfer #1

Intro | Neural Style Transfer #1

Aleksa Gordić - The AI Epiphany

Basic Theory | Neural Style Transfer #2

Basic Theory | Neural Style Transfer #2

Aleksa Gordić - The AI Epiphany

Optimization method | Neural Style Transfer #3

Optimization method | Neural Style Transfer #3

Aleksa Gordić - The AI Epiphany

Advanced Theory | Neural Style Transfer #4

Advanced Theory | Neural Style Transfer #4

Aleksa Gordić - The AI Epiphany

Anyone can make deepfakes now!

Anyone can make deepfakes now!

Aleksa Gordić - The AI Epiphany

What is Computer Vision? | The Art of Creating Seeing Machines

What is Computer Vision? | The Art of Creating Seeing Machines

Aleksa Gordić - The AI Epiphany

Feed-forward method | Neural Style Transfer #5

Feed-forward method | Neural Style Transfer #5

Aleksa Gordić - The AI Epiphany

Alan Turing | Computing Machinery and Intelligence

Alan Turing | Computing Machinery and Intelligence

Aleksa Gordić - The AI Epiphany

Feed-forward method (training) | Neural Style Transfer #6

Feed-forward method (training) | Neural Style Transfer #6

Aleksa Gordić - The AI Epiphany

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

Aleksa Gordić - The AI Epiphany

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany

How to get started with Machine Learning

How to get started with Machine Learning

Aleksa Gordić - The AI Epiphany

How to learn PyTorch? (3 easy steps) | 2021

How to learn PyTorch? (3 easy steps) | 2021

Aleksa Gordić - The AI Epiphany

PyTorch or TensorFlow?

PyTorch or TensorFlow?

Aleksa Gordić - The AI Epiphany

3 Machine Learning Projects For Beginners (Highly visual) | 2021

3 Machine Learning Projects For Beginners (Highly visual) | 2021

Aleksa Gordić - The AI Epiphany

Machine Learning Projects (Intermediate level) | 2021

Machine Learning Projects (Intermediate level) | 2021

Aleksa Gordić - The AI Epiphany

Cheapest (0$) Deep Learning Hardware Options | 2021

Cheapest (0$) Deep Learning Hardware Options | 2021

Aleksa Gordić - The AI Epiphany

How to learn deep learning? (Transformers Example)

How to learn deep learning? (Transformers Example)

Aleksa Gordić - The AI Epiphany

How do transformers work? (Attention is all you need)

How do transformers work? (Attention is all you need)

Aleksa Gordić - The AI Epiphany

Developing a deep learning project (case study on transformer)

Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Aleksa Gordić - The AI Epiphany

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Aleksa Gordić - The AI Epiphany

Attention Is All You Need (Transformer) | Paper Explained

Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Attention Networks (GAT) | GNN Paper Explained

Graph Attention Networks (GAT) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Convolutional Networks (GCN) | GNN Paper Explained

Graph Convolutional Networks (GCN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

Aleksa Gordić - The AI Epiphany

OpenAI CLIP - Connecting Text and Images | Paper Explained

OpenAI CLIP - Connecting Text and Images | Paper Explained

Aleksa Gordić - The AI Epiphany

Temporal Graph Networks (TGN) | GNN Paper Explained

Temporal Graph Networks (TGN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Aleksa Gordić - The AI Epiphany

Graph Attention Network Project Walkthrough

Graph Attention Network Project Walkthrough

Aleksa Gordić - The AI Epiphany

How to get started with Graph ML? (Blog walkthrough)

How to get started with Graph ML? (Blog walkthrough)

Aleksa Gordić - The AI Epiphany

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

Aleksa Gordić - The AI Epiphany

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

Aleksa Gordić - The AI Epiphany

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

Aleksa Gordić - The AI Epiphany

Implementing DeepMind's DQN from scratch! | Project Update

Implementing DeepMind's DQN from scratch! | Project Update

Aleksa Gordić - The AI Epiphany

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany

DeepMind's Android RL Environment - AndroidEnv

DeepMind's Android RL Environment - AndroidEnv

Aleksa Gordić - The AI Epiphany

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

Aleksa Gordić - The AI Epiphany

Non-Parametric Transformers | Paper explained

Non-Parametric Transformers | Paper explained

Aleksa Gordić - The AI Epiphany

Chip Placement with Deep Reinforcement Learning | Paper Explained

Chip Placement with Deep Reinforcement Learning | Paper Explained

Aleksa Gordić - The AI Epiphany

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Aleksa Gordić - The AI Epiphany

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

Aleksa Gordić - The AI Epiphany

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

Aleksa Gordić - The AI Epiphany

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Aleksa Gordić - The AI Epiphany

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Aleksa Gordić - The AI Epiphany

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

Aleksa Gordić - The AI Epiphany

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

Aleksa Gordić - The AI Epiphany

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

Aleksa Gordić - The AI Epiphany

DETR: End-to-End Object Detection with Transformers | Paper Explained

DETR: End-to-End Object Detection with Transformers | Paper Explained

Aleksa Gordić - The AI Epiphany

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Aleksa Gordić - The AI Epiphany

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Aleksa Gordić - The AI Epiphany

The MuZero algorithm is a reinforcement learning model that achieves superhuman performance in complex games without any knowledge of the underlying dynamics. It uses a tree-based search with a learned model, which is trained end-to-end with the sole objective of accurately estimating policy, value function, and immediate reward.

Key Takeaways

Train a MuZero model on a dataset of game trajectories
Implement a tree-based search with a learned model
Use a dynamics model to predict the next state and reward
Use a prediction model to predict the policy and value function
Update the policy and value function at each step

💡 The MuZero algorithm achieves superhuman performance in complex games without any knowledge of the underlying dynamics, using a tree-based search with a learned model.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related Reads

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Chapters (6)

Overview of the AlphaGo lineage

3:00 MuZero actors explained

11:10 How can MuZero work without any rules?

14:50 MuZero learner explained

21:15 Results

25:15 Update to the search algorithm

Indians Under House Arrest in America? 😱 Immigration Crisis Explained | SumanTV Classroom

SumanTV Classroom