World Models

Yannic Kilcher · Beginner ·🤖 AI Agents & Automation ·8y ago

Key Takeaways

The video discusses building generative neural network models of popular reinforcement learning environments using variational autoencoders and recurrent neural networks, with applications in world modeling and policy learning.

Full Transcript

hi today we're looking at world models by David ha and Jung schmidhuber this is a paper that's concerned with reinforcement learning and especially with the problem of say you have an environment that you interact with and you kind of need to learn to act in it but it could be for example very expensive to always query the environment so let's say you have like a robot and it needs to do something in the world and you kind of to have a robot execute something and then observe it is quite expensive cost electricity and so on so you would like to sort of minimize how many times the how many times this happens so here searching for a good picture they're concerned with problems for example like this this is a race car simulator simulator there's an open eye gym environment for that the other one that they use is so-called like a doom experiment where as you look at this there's a couple of monsters and they're shooting fireballs at you and the task is just to kind of avoid the fireballs so the entire point of the paper is that I don't actually need to interact with the environment nor to learn it I can simply kind of learn a model of the environment and then learn using that model so basically I can learn how the environment works and then simply use my imagination of the environment my model in order to to learn from that so I don't have to interact with the real environment anymore so how do they do this they do it in multiple stages here first thing they do is they collect a bunch of samples from the environment so they go to the environment they simply do a random policy and then they collect a bunch of samples I think the process is outlined down here somewhere we saw it before here collect ten thousand roll outs from around the policy next they train a VA e here to kind of learn the environment so that's where that's where that comes in this is all done in stage it's not end to end the VI e is simply a model that takes a takes in this case a video frame here and it sends it through an encoder neural network to obtain what's called a latent representation which is a much smaller dimensional representation so if the image is like 64 by 64 pixels then the the latent code could be as little as like a hundred or even ten dimensional so you see that there's quite a bit of compression going on and this is a variational auto encoder it's not really important here that it's variational since they the difference is the variation auto encoder is kind of a stochastic process whereas the regular auto encoder isn't but they introduced elasticity later again so it's not particularly important but so it's a variational auto encoder which means they obtain a latent representation that defines distribution over outputs so they send this they sample from this latent distribution that they obtain and then they feed this to the decoder and the decoder can give back what it thinks the encoder encoded so the decoder tries to reconstruct as close as possible this original frame that was given to the encoder but of course it can't because we've compressed it so much to this to this lower dimensional representation here so it kind of does its best effort so what you hope to achieve with this is that kind of the decoder learns for exam well there's always here this is the the ceiling right here is always gray so basically you you shouldn't actually need to encode this in your in your xiv if it's always gray the decoder should learn this by itself so your hope is that the disease alight in representation will simply end up containing just the information that's kind of different or between the between the individual frames which here I guess would be kind of the the fireballs coming and your position relative to them and that's that's what's changing if you think about this environment so your hope is that the latent representation captures only that whereas all the static parts that are irrelevant or never change are kind of captured by the encoder and the decoder architecture by itself so yeah it's important to note that coder and decoder obviously always the same for all the frames whereas the Z representation of course is there is one per frame so each frame will give you a different Z and that's yeah so you can imagine how that works or how that's going to be useful so they they train this on like a random randomly collected sample of the environment until they're confident they now have a good model of the environment and then what they do next is they use this in order to train an RNN so again they kind of have their their compression model of the environment what they do now is they use these these states you see here here here here that they get from that and they train how these latent representations evolve over time so with an origin here it goes over time so the RNN will always kind of predict what's the next stage of the environment going to be but importantly maybe compared to environment models that we've discussed before in for example imagination augmented agent paper there we always try to directly predict the future pixels so to say of the future frame here the environment model is over the latent representation of course this means that the this is much smaller space so if you're compression model is good then this should be much easier to learn than say like a full end-to-end environment model so this model learns how your latent states evolve over time given your actions right so you can you can imagine that the Z being an abstract representation of your state and then your action and the distal isn't adorned and the Ireland will predict what's the next latent representation and there is a what's called a temperature parameter to control the stochasticity I've already told you this this um there is a stockist his stochasticity built in to this so the RNA will simply output like some some vector what it thinks is the next thing going to be and they don't use this directly as the next step but they parameterize a kind of a mixture of Gaussian distributions coupled with a decoder here in order to give a random distribution over there the next state and they control the amount of randomness with the temperature parameter they argue that this comes in handy later so alright so what do we have we have a system that can compress the environment into what we would call an essential part right every frame we extract there what's important in that frame then next we have a model that can predict given a state and in action what's the next state going to be the next latent state so technically we now have an environment model right given a state we can simply give a state and a policy we can simply use this model to roll forward so the last component is the actual policy and the actual policy here as you can see is in there K simply a linear model the linear model will take the Z which is the latent representation of the current state and the H which is the current state of the of the RNN that models the environment over time and it will it simply is a linear linear function of the two gives you the action probabilities or I guess the law gates of the actions so it's a really really simple controller over these things and they do this in order kind of to show that that the main part of the work is being done by this environment model and given the environment model you only need very few parameters basically to then learn a policy here is the kind of what I said in in a diagram so the observation goes into the the compression of the VA e the latent representation of that goes into the RN together with the hidden state from the last step and this will out output a new hidden state which goes here into the controller and we also directly take this Z into the controller and then from these two we perform an action which now we have a choice it could go to the environment right give you the next observation but also or at the same time since you kind of need to update your or an n it can go here and update your or an N because it will need to predict the next state the thing is we can also now leave away this path which means we can simply take our RNN and our kind of imagine the next latent representation put it through the decoder part of the VI e and use that as an observation I hope this makes sense it it's it's rather intuitive right you have a model of the environment you can simply use this instead of the real environment so that's a bit of pseudocode here and they do a bunch of experiments right so we're primarily interested so they they say they see here how okay our compression works and this is the real frame and this is the reconstructed frame kind of looks you know captures the essence of what's going on and I actually want to go down here the visiting experiment so what they what they do here in the car racing experiment is they kind of learn this entire thing right and then they learn a policy in the in the real world so in the environment using using this model up here this procedure where they always go to the environment and here is the exact experiment setup so first day were they collect again robots for a random policy they train to be a tight rein the RNN and then they learn the the latent sorry they learn the controller using the entire model but in kind of the real world so they always interact with the environment but because they also have their kind of latent representation of the observation and not directly the observation they get a higher score and also the policy that they use in the real environment transfers to the environment model so the policy that learn in the true environment it transfers to the imagined so if they use the imagined model as an environment it also performs well in the next experiment they're gonna try to do this the other way around they're gonna try to learn only using their model of the environment and then see whether or not the policy transfers to the true environment so that's what they do here they collect collect again a sample from the environment they train the via either trained RNN and then they simply use this virtual environment what they call it in order to learn a policy and at the end they try to transfer you to learn policy on the actual environment and given the results you see yep so they you see the kind of best it it does I would say is about here where the actual score is you can see in this and also in this setting is higher than the kind of previous best algorithm in the open era gene when you go from virtual to actual so what this means is kind of yeah you can you can train using this imagined model and then it will actually transfer but but there's a crucial thing and that is this kind of temperature thing here you can see a lot of times they actually don't manage to reach a good score if this parameter is wrong what does this parameter do this parameter controls as we discussed the stochasticity of the model so basically the environment model doesn't directly imagine future state but it imagines a distribution over future states and the higher this parameter the more stochastic this distribution is basically the more uniform I guess the more entropy you have in these future states we just result we've seen this temperature parameter here which is important because they go into length explaining why in this entire page here that we skipped you see just text they're cheating the world model which basically they say okay if you have a wrong model if you have a model that's wrong on the environment and you train a policy on it necessarily it's gonna make probably find like a policy then that's kind of exploits the wrongness of this model so you might be able to walk through walls or fly or ignore the the fireballs or basically yeah fine find some or find that if you stand next to a wall in your imagination you'll never get hit something like this which isn't true in the real world and so the the policy will exploit that and to counter this they simply basically turn up this temperature parameter giving them a more stochastic procedure meaning they imagine a lot of kind of different futures and they turn their policy on all of them or in expectation of a sample of them which means that if the environment model is wrong it is kind of I wanna say if it's wrong this correct ready it doesn't but if it's wrong you still sample different futures so if it has one wrong future you still have the other ones to kind of punish the policy if it tries to exploit this one mistake at least that's the the reasoning behind it so the that's that's how they do this you can interact with their trained environment models online somehow they also give a a kind of a look at what they would like to have his they would like to kind of instead of collecting the environment model from random rollout to take would try to train it and to use it again to collect more data to train more environment model than use the environment better environment model to train more the policy and so on in a stepwise fashion book they don't actually do it they simply describe it yeah and the rest of the paper is a bit of related work and discussion it's very it's very prosaically written kind of different from what you're used to if you read a lot of these papers but yeah I hope you can know you know what's going on and see you next time

Original Description

Authors: David Ha, Jürgen Schmidhuber Abstract: We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. https://arxiv.org/abs/1803.10122
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 7 of 60

1 Imagination-Augmented Agents for Deep Reinforcement Learning
Imagination-Augmented Agents for Deep Reinforcement Learning
Yannic Kilcher
2 Learning model-based planning from scratch
Learning model-based planning from scratch
Yannic Kilcher
3 Reinforcement Learning with Unsupervised Auxiliary Tasks
Reinforcement Learning with Unsupervised Auxiliary Tasks
Yannic Kilcher
4 Attention Is All You Need
Attention Is All You Need
Yannic Kilcher
5 git for research basics: fundamentals, commits, branches, merging
git for research basics: fundamentals, commits, branches, merging
Yannic Kilcher
6 Curiosity-driven Exploration by Self-supervised Prediction
Curiosity-driven Exploration by Self-supervised Prediction
Yannic Kilcher
World Models
World Models
Yannic Kilcher
8 Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Yannic Kilcher
9 Stochastic RNNs without Teacher-Forcing
Stochastic RNNs without Teacher-Forcing
Yannic Kilcher
10 What’s in a name? The need to nip NIPS
What’s in a name? The need to nip NIPS
Yannic Kilcher
11 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Yannic Kilcher
12 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Yannic Kilcher
13 GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Yannic Kilcher
14 Neural Ordinary Differential Equations
Neural Ordinary Differential Equations
Yannic Kilcher
15 The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
Yannic Kilcher
16 Discriminating Systems - Gender, Race, and Power in AI
Discriminating Systems - Gender, Race, and Power in AI
Yannic Kilcher
17 Blockwise Parallel Decoding for Deep Autoregressive Models
Blockwise Parallel Decoding for Deep Autoregressive Models
Yannic Kilcher
18 S.H.E. - Search. Human. Equalizer.
S.H.E. - Search. Human. Equalizer.
Yannic Kilcher
19 Reinforcement Learning, Fast and Slow
Reinforcement Learning, Fast and Slow
Yannic Kilcher
20 Adversarial Examples Are Not Bugs, They Are Features
Adversarial Examples Are Not Bugs, They Are Features
Yannic Kilcher
21 I'm at ICML19 :)
I'm at ICML19 :)
Yannic Kilcher
22 Population-Based Search and Open-Ended Algorithms
Population-Based Search and Open-Ended Algorithms
Yannic Kilcher
23 XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Yannic Kilcher
24 Conversation about Population-Based Methods (Re-upload)
Conversation about Population-Based Methods (Re-upload)
Yannic Kilcher
25 Reconciling modern machine learning and the bias-variance trade-off
Reconciling modern machine learning and the bias-variance trade-off
Yannic Kilcher
26 Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
Yannic Kilcher
27 Manifold Mixup: Better Representations by Interpolating Hidden States
Manifold Mixup: Better Representations by Interpolating Hidden States
Yannic Kilcher
28 Processing Megapixel Images with Deep Attention-Sampling Models
Processing Megapixel Images with Deep Attention-Sampling Models
Yannic Kilcher
29 Gauge Equivariant Convolutional Networks and the Icosahedral CNN
Gauge Equivariant Convolutional Networks and the Icosahedral CNN
Yannic Kilcher
30 Auditing Radicalization Pathways on YouTube
Auditing Radicalization Pathways on YouTube
Yannic Kilcher
31 RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yannic Kilcher
32 Dynamic Routing Between Capsules
Dynamic Routing Between Capsules
Yannic Kilcher
33 DEEP LEARNING MEME REVIEW - Episode 1
DEEP LEARNING MEME REVIEW - Episode 1
Yannic Kilcher
34 Accelerating Deep Learning by Focusing on the Biggest Losers
Accelerating Deep Learning by Focusing on the Biggest Losers
Yannic Kilcher
35 [News] The Siraj Raval Controversy
[News] The Siraj Raval Controversy
Yannic Kilcher
36 LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games
LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games
Yannic Kilcher
37 The Visual Task Adaptation Benchmark
The Visual Task Adaptation Benchmark
Yannic Kilcher
38 IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Yannic Kilcher
39 AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Yannic Kilcher
40 SinGAN: Learning a Generative Model from a Single Natural Image
SinGAN: Learning a Generative Model from a Single Natural Image
Yannic Kilcher
41 A neurally plausible model learns successor representations in partially observable environments
A neurally plausible model learns successor representations in partially observable environments
Yannic Kilcher
42 MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Yannic Kilcher
43 Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
Yannic Kilcher
44 NeurIPS 19 Poster Session
NeurIPS 19 Poster Session
Yannic Kilcher
45 Go-Explore: a New Approach for Hard-Exploration Problems
Go-Explore: a New Approach for Hard-Exploration Problems
Yannic Kilcher
46 Reformer: The Efficient Transformer
Reformer: The Efficient Transformer
Yannic Kilcher
47 [Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization
[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization
Yannic Kilcher
48 Turing-NLG, DeepSpeed and the ZeRO optimizer
Turing-NLG, DeepSpeed and the ZeRO optimizer
Yannic Kilcher
49 Growing Neural Cellular Automata
Growing Neural Cellular Automata
Yannic Kilcher
50 NeurIPS 2020 Changes to Paper Submission Process
NeurIPS 2020 Changes to Paper Submission Process
Yannic Kilcher
51 Deep Learning for Symbolic Mathematics
Deep Learning for Symbolic Mathematics
Yannic Kilcher
52 Online Education - How I Make My Videos
Online Education - How I Make My Videos
Yannic Kilcher
53 [Rant] coronavirus
[Rant] coronavirus
Yannic Kilcher
54 Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting
Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting
Yannic Kilcher
55 Agent57: Outperforming the Atari Human Benchmark
Agent57: Outperforming the Atari Human Benchmark
Yannic Kilcher
56 State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication
State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication
Yannic Kilcher
57 Dream to Control: Learning Behaviors by Latent Imagination
Dream to Control: Learning Behaviors by Latent Imagination
Yannic Kilcher
58 POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions
POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions
Yannic Kilcher
59 Evaluating NLP Models via Contrast Sets
Evaluating NLP Models via Contrast Sets
Yannic Kilcher
60 [Drama] Who invented Contrast Sets?
[Drama] Who invented Contrast Sets?
Yannic Kilcher

This video teaches how to build generative neural network models of reinforcement learning environments using variational autoencoders and RNNs, and how to use these models for policy learning and stochasticity control. The key concepts include environment modeling, latent representations, and stochasticity, and the video provides a step-by-step guide on how to train and use these models.

Key Takeaways
  1. Collect ten thousand rollouts from random policy
  2. Train VAE to learn latent representation of environment
  3. Sample from latent distribution and feed to decoder
  4. Train RNN to predict evolution of latent representations over time
  5. Use temperature parameter to control stochasticity in RNN's output
  6. Parameterize mixture of Gaussian distributions for stochasticity
  7. Use linear model as simple controller to determine action probabilities
💡 The temperature parameter controls the stochasticity of the model, and turning it up gives a more stochastic procedure, sampling different futures and punishing the policy if it tries to exploit a mistake.

Related AI Lessons

Stop Blaming the Model. Your AI Agents Need a Control Plane
Learn why a control plane is crucial for AI agents, going beyond just the core agent loop
Medium · Data Science
What 12 failure classes and 30 Billion tokens spent taught us about trusting AI coding agents
Learn from 12 failure classes of AI coding agents to improve trust and reliability in production environments
Dev.to · keesan.eth
Lumo Is a Privacy-Focused AI Chatbot, With Clear Limits
Learn about Lumo, a privacy-focused AI chatbot with no chat logs, and understand its implications on user data protection
Dev.to · Simon Paxton
I Let 5 AI Agents Shop For Me in 2026. It Went About as Well as You’d Expect.
Learn from an experiment where 5 AI agents were used to shop for everyday items, highlighting what works and what doesn't in AI-powered shopping
Medium · AI
Up next
Building Great Agent Skills: The Missing Manual
AI Engineer
Watch →