World Models

Yannic Kilcher · Beginner ·🤖 AI Agents & Automation ·8y ago

Skills: Agent Foundations90%Tool Use & Function Calling80%Multi-Agent Systems70%Autonomous Workflows70%ML Maths Basics60%

Key Takeaways

The video discusses building generative neural network models of popular reinforcement learning environments using variational autoencoders and recurrent neural networks, with applications in world modeling and policy learning.

Full Transcript

hi today we're looking at world models by David ha and Jung schmidhuber this is a paper that's concerned with reinforcement learning and especially with the problem of say you have an environment that you interact with and you kind of need to learn to act in it but it could be for example very expensive to always query the environment so let's say you have like a robot and it needs to do something in the world and you kind of to have a robot execute something and then observe it is quite expensive cost electricity and so on so you would like to sort of minimize how many times the how many times this happens so here searching for a good picture they're concerned with problems for example like this this is a race car simulator simulator there's an open eye gym environment for that the other one that they use is so-called like a doom experiment where as you look at this there's a couple of monsters and they're shooting fireballs at you and the task is just to kind of avoid the fireballs so the entire point of the paper is that I don't actually need to interact with the environment nor to learn it I can simply kind of learn a model of the environment and then learn using that model so basically I can learn how the environment works and then simply use my imagination of the environment my model in order to to learn from that so I don't have to interact with the real environment anymore so how do they do this they do it in multiple stages here first thing they do is they collect a bunch of samples from the environment so they go to the environment they simply do a random policy and then they collect a bunch of samples I think the process is outlined down here somewhere we saw it before here collect ten thousand roll outs from around the policy next they train a VA e here to kind of learn the environment so that's where that's where that comes in this is all done in stage it's not end to end the VI e is simply a model that takes a takes in this case a video frame here and it sends it through an encoder neural network to obtain what's called a latent representation which is a much smaller dimensional representation so if the image is like 64 by 64 pixels then the the latent code could be as little as like a hundred or even ten dimensional so you see that there's quite a bit of compression going on and this is a variational auto encoder it's not really important here that it's variational since they the difference is the variation auto encoder is kind of a stochastic process whereas the regular auto encoder isn't but they introduced elasticity later again so it's not particularly important but so it's a variational auto encoder which means they obtain a latent representation that defines distribution over outputs so they send this they sample from this latent distribution that they obtain and then they feed this to the decoder and the decoder can give back what it thinks the encoder encoded so the decoder tries to reconstruct as close as possible this original frame that was given to the encoder but of course it can't because we've compressed it so much to this to this lower dimensional representation here so it kind of does its best effort so what you hope to achieve with this is that kind of the decoder learns for exam well there's always here this is the the ceiling right here is always gray so basically you you shouldn't actually need to encode this in your in your xiv if it's always gray the decoder should learn this by itself so your hope is that the disease alight in representation will simply end up containing just the information that's kind of different or between the between the individual frames which here I guess would be kind of the the fireballs coming and your position relative to them and that's that's what's changing if you think about this environment so your hope is that the latent representation captures only that whereas all the static parts that are irrelevant or never change are kind of captured by the encoder and the decoder architecture by itself so yeah it's important to note that coder and decoder obviously always the same for all the frames whereas the Z representation of course is there is one per frame so each frame will give you a different Z and that's yeah so you can imagine how that works or how that's going to be useful so they they train this on like a random randomly collected sample of the environment until they're confident they now have a good model of the environment and then what they do next is they use this in order to train an RNN so again they kind of have their their compression model of the environment what they do now is they use these these states you see here here here here that they get from that and they train how these latent representations evolve over time so with an origin here it goes over time so the RNN will always kind of predict what's the next stage of the environment going to be but importantly maybe compared to environment models that we've discussed before in for example imagination augmented agent paper there we always try to directly predict the future pixels so to say of the future frame here the environment model is over the latent representation of course this means that the this is much smaller space so if you're compression model is good then this should be much easier to learn than say like a full end-to-end environment model so this model learns how your latent states evolve over time given your actions right so you can you can imagine that the Z being an abstract representation of your state and then your action and the distal isn't adorned and the Ireland will predict what's the next latent representation and there is a what's called a temperature parameter to control the stochasticity I've already told you this this um there is a stockist his stochasticity built in to this so the RNA will simply output like some some vector what it thinks is the next thing going to be and they don't use this directly as the next step but they parameterize a kind of a mixture of Gaussian distributions coupled with a decoder here in order to give a random distribution over there the next state and they control the amount of randomness with the temperature parameter they argue that this comes in handy later so alright so what do we have we have a system that can compress the environment into what we would call an essential part right every frame we extract there what's important in that frame then next we have a model that can predict given a state and in action what's the next state going to be the next latent state so technically we now have an environment model right given a state we can simply give a state and a policy we can simply use this model to roll forward so the last component is the actual policy and the actual policy here as you can see is in there K simply a linear model the linear model will take the Z which is the latent representation of the current state and the H which is the current state of the of the RNN that models the environment over time and it will it simply is a linear linear function of the two gives you the action probabilities or I guess the law gates of the actions so it's a really really simple controller over these things and they do this in order kind of to show that that the main part of the work is being done by this environment model and given the environment model you only need very few parameters basically to then learn a policy here is the kind of what I said in in a diagram so the observation goes into the the compression of the VA e the latent representation of that goes into the RN together with the hidden state from the last step and this will out output a new hidden state which goes here into the controller and we also directly take this Z into the controller and then from these two we perform an action which now we have a choice it could go to the environment right give you the next observation but also or at the same time since you kind of need to update your or an n it can go here and update your or an N because it will need to predict the next state the thing is we can also now leave away this path which means we can simply take our RNN and our kind of imagine the next latent representation put it through the decoder part of the VI e and use that as an observation I hope this makes sense it it's it's rather intuitive right you have a model of the environment you can simply use this instead of the real environment so that's a bit of pseudocode here and they do a bunch of experiments right so we're primarily interested so they they say they see here how okay our compression works and this is the real frame and this is the reconstructed frame kind of looks you know captures the essence of what's going on and I actually want to go down here the visiting experiment so what they what they do here in the car racing experiment is they kind of learn this entire thing right and then they learn a policy in the in the real world so in the environment using using this model up here this procedure where they always go to the environment and here is the exact experiment setup so first day were they collect again robots for a random policy they train to be a tight rein the RNN and then they learn the the latent sorry they learn the controller using the entire model but in kind of the real world so they always interact with the environment but because they also have their kind of latent representation of the observation and not directly the observation they get a higher score and also the policy that they use in the real environment transfers to the environment model so the policy that learn in the true environment it transfers to the imagined so if they use the imagined model as an environment it also performs well in the next experiment they're gonna try to do this the other way around they're gonna try to learn only using their model of the environment and then see whether or not the policy transfers to the true environment so that's what they do here they collect collect again a sample from the environment they train the via either trained RNN and then they simply use this virtual environment what they call it in order to learn a policy and at the end they try to transfer you to learn policy on the actual environment and given the results you see yep so they you see the kind of best it it does I would say is about here where the actual score is you can see in this and also in this setting is higher than the kind of previous best algorithm in the open era gene when you go from virtual to actual so what this means is kind of yeah you can you can train using this imagined model and then it will actually transfer but but there's a crucial thing and that is this kind of temperature thing here you can see a lot of times they actually don't manage to reach a good score if this parameter is wrong what does this parameter do this parameter controls as we discussed the stochasticity of the model so basically the environment model doesn't directly imagine future state but it imagines a distribution over future states and the higher this parameter the more stochastic this distribution is basically the more uniform I guess the more entropy you have in these future states we just result we've seen this temperature parameter here which is important because they go into length explaining why in this entire page here that we skipped you see just text they're cheating the world model which basically they say okay if you have a wrong model if you have a model that's wrong on the environment and you train a policy on it necessarily it's gonna make probably find like a policy then that's kind of exploits the wrongness of this model so you might be able to walk through walls or fly or ignore the the fireballs or basically yeah fine find some or find that if you stand next to a wall in your imagination you'll never get hit something like this which isn't true in the real world and so the the policy will exploit that and to counter this they simply basically turn up this temperature parameter giving them a more stochastic procedure meaning they imagine a lot of kind of different futures and they turn their policy on all of them or in expectation of a sample of them which means that if the environment model is wrong it is kind of I wanna say if it's wrong this correct ready it doesn't but if it's wrong you still sample different futures so if it has one wrong future you still have the other ones to kind of punish the policy if it tries to exploit this one mistake at least that's the the reasoning behind it so the that's that's how they do this you can interact with their trained environment models online somehow they also give a a kind of a look at what they would like to have his they would like to kind of instead of collecting the environment model from random rollout to take would try to train it and to use it again to collect more data to train more environment model than use the environment better environment model to train more the policy and so on in a stepwise fashion book they don't actually do it they simply describe it yeah and the rest of the paper is a bit of related work and discussion it's very it's very prosaically written kind of different from what you're used to if you read a lot of these papers but yeah I hope you can know you know what's going on and see you next time

Original Description

Authors: David Ha, Jürgen Schmidhuber Abstract: We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. https://arxiv.org/abs/1803.10122

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 7 of 60

← Previous Next →

Imagination-Augmented Agents for Deep Reinforcement Learning

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Learning model-based planning from scratch

Reinforcement Learning with Unsupervised Auxiliary Tasks

Reinforcement Learning with Unsupervised Auxiliary Tasks

Attention Is All You Need

Attention Is All You Need

git for research basics: fundamentals, commits, branches, merging

git for research basics: fundamentals, commits, branches, merging

Curiosity-driven Exploration by Self-supervised Prediction

Curiosity-driven Exploration by Self-supervised Prediction

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Stochastic RNNs without Teacher-Forcing

Stochastic RNNs without Teacher-Forcing

What’s in a name? The need to nip NIPS

What’s in a name? The need to nip NIPS

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

GPT-2: Language Models are Unsupervised Multitask Learners

GPT-2: Language Models are Unsupervised Multitask Learners

Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Discriminating Systems - Gender, Race, and Power in AI

Discriminating Systems - Gender, Race, and Power in AI

Blockwise Parallel Decoding for Deep Autoregressive Models

Blockwise Parallel Decoding for Deep Autoregressive Models

S.H.E. - Search. Human. Equalizer.

S.H.E. - Search. Human. Equalizer.

Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow

Adversarial Examples Are Not Bugs, They Are Features

Adversarial Examples Are Not Bugs, They Are Features

I'm at ICML19 :)

I'm at ICML19 :)

Population-Based Search and Open-Ended Algorithms

Population-Based Search and Open-Ended Algorithms

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Conversation about Population-Based Methods (Re-upload)

Conversation about Population-Based Methods (Re-upload)

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Manifold Mixup: Better Representations by Interpolating Hidden States

Manifold Mixup: Better Representations by Interpolating Hidden States

Processing Megapixel Images with Deep Attention-Sampling Models

Processing Megapixel Images with Deep Attention-Sampling Models

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Auditing Radicalization Pathways on YouTube

Auditing Radicalization Pathways on YouTube

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules

DEEP LEARNING MEME REVIEW - Episode 1

DEEP LEARNING MEME REVIEW - Episode 1

Accelerating Deep Learning by Focusing on the Biggest Losers

Accelerating Deep Learning by Focusing on the Biggest Losers

[News] The Siraj Raval Controversy

[News] The Siraj Raval Controversy

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

The Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model learns successor representations in partially observable environments

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

NeurIPS 19 Poster Session

NeurIPS 19 Poster Session

Go-Explore: a New Approach for Hard-Exploration Problems

Go-Explore: a New Approach for Hard-Exploration Problems

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

Turing-NLG, DeepSpeed and the ZeRO optimizer

Turing-NLG, DeepSpeed and the ZeRO optimizer

Growing Neural Cellular Automata

Growing Neural Cellular Automata

NeurIPS 2020 Changes to Paper Submission Process

NeurIPS 2020 Changes to Paper Submission Process

Deep Learning for Symbolic Mathematics

Deep Learning for Symbolic Mathematics

Online Education - How I Make My Videos

Online Education - How I Make My Videos

[Rant] coronavirus

[Rant] coronavirus

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Agent57: Outperforming the Atari Human Benchmark

Agent57: Outperforming the Atari Human Benchmark

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

Dream to Control: Learning Behaviors by Latent Imagination

Dream to Control: Learning Behaviors by Latent Imagination

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

Evaluating NLP Models via Contrast Sets

Evaluating NLP Models via Contrast Sets

[Drama] Who invented Contrast Sets?

[Drama] Who invented Contrast Sets?

This video teaches how to build generative neural network models of reinforcement learning environments using variational autoencoders and RNNs, and how to use these models for policy learning and stochasticity control. The key concepts include environment modeling, latent representations, and stochasticity, and the video provides a step-by-step guide on how to train and use these models.

Key Takeaways

Collect ten thousand rollouts from random policy
Train VAE to learn latent representation of environment
Sample from latent distribution and feed to decoder
Train RNN to predict evolution of latent representations over time
Use temperature parameter to control stochasticity in RNN's output
Parameterize mixture of Gaussian distributions for stochasticity
Use linear model as simple controller to determine action probabilities

💡 The temperature parameter controls the stochasticity of the model, and turning it up gives a more stochastic procedure, sampling different futures and punishing the policy if it tries to exploit a mistake.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related AI Lessons

The Hidden Cost of Repetitive Online Work (And How AI Is Changing It)

Discover how AI can help reduce the hidden costs of repetitive online work and improve business efficiency

How to Use AI Automation to Scale Your Freelance Business in 2026

Learn how to scale your freelance business using AI automation to increase efficiency and profitability in 2026

AI Agents Struggle To Read B2B Pricing, Report Finds via @sejournal, @MattGSouthern

AI agents struggle to read B2B pricing, often relying on third-party sources, highlighting limitations in current AI technology

Search Engine Journal

Five tool-calling patterns that separate hobby AI agents from production ones

Learn 5 tool-calling patterns to take your AI agents from hobby to production-ready, handling failures and unexpected results

Dev.to · Penloom Studio

Building Great Agent Skills: The Missing Manual