Dream to Control: Learning Behaviors by Latent Imagination

Yannic Kilcher · Beginner ·🤖 AI Agents & Automation ·6y ago

Skills: Agent Foundations90%Tool Use & Function Calling70%Multi-Agent Systems60%Autonomous Workflows60%ML Maths Basics50%

Key Takeaways

The video discusses Dreamer, a new RL agent by DeepMind that learns continuous control tasks through forward-imagination in latent space, using techniques such as latent imagination, policy learning, and representation learning with tools like CNN and LSTM.

Full Transcript

hi there today we're looking at dream to control learning behaviors by latent imagination by Doniger Hoffner Timothy Lilly corrupt Timmy sorry Jimmy ba and Mohammed neuro Z this is a reinforcement learning paper that iterates on a kind of a series of previous papers where the goal is to learn a policy in this case they want to learn policies for these kind of continuous control tasks so these um physics-based robots these hopper or Walker types of tasks where you have to control this this robot these joints in order to move forward and so the the goal is that you have multiple observations as you do in in reinforcement learning and from each observation you need to somehow come up with an action of what to do and then that will give you the next observation as well as a reward a reward for you know if you if your goal is to move this spider maybe the reward is proportional to how far you move so your goal is to collect the maximum reward which would mean you have to move the spider as far as possible simply by doing the correct actions the goal of this paper now is to to do this by learning how the by learning to sort of plan ahead in this latent space so as you can see here the way they do it is they take the observation and they feed it through an encoder now you can think of this as maybe a convolutional neural network or something anything that can work that can take an image as an input and give you a hidden representation so now this here is the hidden representation from this hidden repairs patien you can determine what the next action is going to be and then you get a new observation and then again you can feed that along with the last hidden state into a new hidden state so this this is already on previous previous models do this a lot right you encode your observation and you have a sort of an let's say a recurrent neural network that incorporates all of the observations into a hidden state along with the actions you take and then you always decide on a next action to do so what does this model do differently this model wants to do this all in hidden space so what this model wants to do is it wants to say okay I am here I have this observation now my encoder tells me that this is going to give me this hidden state and now what it wants to do is it wants to take in the action that is doing and without seeing the next observation right it wants to predict it already wants to say well if I am here and I do this action what might the action be the action might be to put the joystick to the right it will learn the hidden state corresponding to the spider being a bit more to the right right so this is a bit more to the right than it is right now and it will you do so a number of time steps into the future and it will kind of learn from its own imagination so this this is what um it will imagine into the future how the hidden states look and then it will learn from that instead of having to really do the actions in the real world now we've already looked at a number of papers including something like mu zero or I 2a or something like this this now is only is slightly different so you can see so what what's different here what is different is in new 0 we had this we use this latent model in order to plan ahead like in order to do our decision tree planning ahead and so on this model doesn't do this this model still wants to come up with a single policy where you encode your state right this is on the right is the final result you encode your state gets you to a hidden representation and then from that you determine what your actions going to be and you have your next state and so on so the final goal is simply going to be a policy like a single-shot policy without any Monte Carlo tree expansion and so on but what it wants to do is it wants to learn this policy not by interacting in the real world like here on the left but actually by interacting only in the dream world right here so the crucial part if you want to learn from your dreams right is to make sure that your dreams are an accurate representation of the of the real world right we already saw this in a paper called world models by jurgen schmidhuber i believe and in that paper what they did was they first collected experience such like experience like this one and then they learned from the one observation to predict the next ones and idle or to predict the next hidden states right they did so by basically moving in the world at random so they have this little spider thingy and they just do random movements right they randomly move around and thus they collect these trajectories and then they learn from the random trajectories the difference that this paper does is it does these steps iteratively so it will not learn from random policy but it will actually first yeah it'll start out learning this random learning a good policy for its environment model then acting going back and using that policy in order to learn a better environment model and then again learn using the better environment model in order to learn a better policy if this wasn't clear enough we'll jump to the algorithm the algorithm isn't actually too too complicated as I said it's it's I think it's a relatively minor iteration on previous research but it appears to work and it works in these kind of continuous control tasks so you see you have three models here that you need to learn and that's what you see over here there is representation transition and reward and you'll see they all have the same parameters so that gives you an indication that these things are a single model now what are what is the model representation transition and reward so let me this this is the the thing on the left here in the in this part of the algorithm you assume that you have a policy you already know what action you do or you can even assume that you have some experience right you have your agent is running with a given policy and you simply collect that and now you're trying to learn so let me scratch all of this what do you have given given is the observation sequence and the actions you took right and the rewards you got that's also given so each action gives you reward right so these things are are given provided to you and now what do you want to learn you want to learn a representation and the transition and let's say a reward so you also want to predict the next reward this thing this thing right so as we already said you can do this by encoding the state using for example a CN N and then using an LST M in order to incorporate this over time so what you learn is the transition from one hidden state to the next hidden state and you also learn the how the observation goes into the hidden state and thirdly you learn that if I'm in this hidden state and I take this particular action I will get this reward in the future all right you can learn this from just a set of precomputed or from a set of experience that you have in your let's say your replay buffer alright this is one model and you learn this here in this first step in this called dynamics learning section right so you see while not converged so you do dynamics learning you draw data sequences from your experience right then you compute the model States these are the hidden States and then you you update this parameter theta using representation learning now they don't really specify what representation learning is but they do give examples of what you can do I think their point is whatever you need to do in order to learn these representation and one example is actually drawn here one example is you can learn a model that reconstructs the next state or actually sorry reconstructs the same state so you can learn a model that predicts so if you give the observation as an input it goes through the hidden state you can learn a decoder that reconstructs that observation this is usually done in in things like variational auto-encoders in order to produce generative models so the this part here would be the generator and that would be kind of the thing of interest if you are doing a variational auto encoder but of course here our quantity of interest is this there's some encoder model because we want a good representation of the state and but but it it comes down to the same thing if you can learn a model that learns to accurately reconstruct the observation then your representation here in the middle is probably an informative one right because you learn the same model across multiple observations that means it can accurately encode what makes one observation different from another one right so this is how you learn the theta parameters right now the other models here are the action and the value parameters and this is here in the step called behavior learning so in the behavior learning what they say is imagine trajectories from each of the states that you have so what you're going to do is from each of the observations here you're going to obtain the hidden states right this these hidden states now from each of the hidden states here so here is an observation from its hidden state you're going to use the model that you learned here through the LST M sorry well this is terrible through the LST M you're going to use that model to imagine future trajectories right of hidden States so you have given sorry given or now is the observation here and the hidden state and you're going to imagine future hidden States you're also going to imagine future rewards right and you are going to use your your policy kind of - you're going to use your policy in order to determine which actions you're going to take right and the ultimate goal here is to learn a good policy so a policy that will give you better rewards in the future as you would do so this is regular reinforcement learning except that the difference is in regular reinforcement learning I have my observation I encode it and then I determine what action I want to take then I feed that action back into the environment which would give me the next observation and then I'd use that to determine maybe in conjunction with the last hidden state the next action in this thing since we learned a dynamics model of the hidden States we can simply determine the action and then simply compute what the probable next hidden state is going to be and then use that to determine an action again and so on so there's no need to go through the environment which means potentially we can learn much much faster without having to expensively interact with the environment so and that allows us to basically also these models here they might be quite large so our back prop now only needs to happen through this path basically if we want to or through through this path here in case we have discrete actions yes so that's in that will be the dynamics learning it's down here and that's agency we predict the rewards and the values and compute value estimates and then we update these parameters using so what we have is here a value function see the value function is dependent on this sigh here and this we update using a gradient of its output minus the true value so this this here is an estimate of the value and as you know a value function is supposed to tell you the complete future we reward given a state right and it's important for us that we have a function that can estimate that because of course then we can take actions if we can make this function go high and this is an accurate function that means we get a lot of reward in the future right so it's important to learn this function and here you can see we adjusted into the direction of matching this quantity better now we'll get to this quantity in a second you can also see we update this parameter which is the action model so here you see that the action model depends on this this is this is our policy right this thing here determines which action we take and we update it into the direction this is a gradient with respect to this value function right so we train the policy to maximize the value which is all the future rewards that we get of course we can do this because we can now back propagate through all of these time steps because we have this we have this transition model we can back propagate through all of this which is pretty cool I think in my opinion the the kind of workhorse of this paper might be this quantity here so what how exactly do you compute the value of a state especially in this continuous control tasks you sometimes have a lot of steps so this these trajectories might be pretty long and they might be longer than what you can back propagate here reasonably from from time step to time step right even an LS TM might only be able to back drop through let's say a couple of dozen or maybe a few hundred steps in time and maybe you have longer trajectories here so it's pretty I think the this value estimate here is a main component of extending that range so they say this is according to equation six and this is what it what it does again this is my opinion that this here is kind of the workhorse of the of the method so it's a three step process actually it's pretty pretty heavy so you see this is the quantity they estimate with the value function it is it is set between an average over so H is the time horizon right that you are looking for it is set between these two things across the sum over the time horizon now each of those things again here is a sum over this tau this towel here which is this Tao and the and H minus 1 and H here is the minimum of tapas K and topless arises so this goes this looks this this quantity looks K steps into the future so for each step to the horizon we look K steps into the future and and for each step we look into the future we sum again across these quantities here and these quantities here what is that it's a mixture of the reward you get in that particular step plus your own your estimate of the value function at the at the horizon step discounted by that so it's pretty so if you imagine you have like a time number of steps that you took and each time you get a reward right this is a very complicated way of summing of going into the future summing up the rewards going more steps summing up the rewards again in different fashion and then mixing these these individual quantities so this one this one this one that you got from accumulating all of these in a weird fashion and that allows you to look way beyond especially you see here your estimate of the value function will actually include your own value function that again will probably looks into the future so what you accumulate from the last step in your time horizon already includes information from all the future steps because you take your own value estimate into account this is I think it's very convoluted but again I think this this is um this complicated value estimate allows you to to to have a better value estimate for into the future they do show some some kind of samples here of what they can do I haven't found any videos of it unfortunately but it appears to work pretty well they have a discussion of different representation learning methods and different experiments and ablations and so on so invite you to look at this paper and I hope this was somewhat clear but I

Original Description

Dreamer is a new RL agent by DeepMind that learns a continuous control task through forward-imagination in latent space. https://arxiv.org/abs/1912.01603 Videos: https://dreamrl.github.io/ Abstract: Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance. Authors: Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 57 of 60

← Previous Next →

Imagination-Augmented Agents for Deep Reinforcement Learning

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Learning model-based planning from scratch

Reinforcement Learning with Unsupervised Auxiliary Tasks

Reinforcement Learning with Unsupervised Auxiliary Tasks

Attention Is All You Need

Attention Is All You Need

git for research basics: fundamentals, commits, branches, merging

git for research basics: fundamentals, commits, branches, merging

Curiosity-driven Exploration by Self-supervised Prediction

Curiosity-driven Exploration by Self-supervised Prediction

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Stochastic RNNs without Teacher-Forcing

Stochastic RNNs without Teacher-Forcing

What’s in a name? The need to nip NIPS

What’s in a name? The need to nip NIPS

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

GPT-2: Language Models are Unsupervised Multitask Learners

GPT-2: Language Models are Unsupervised Multitask Learners

Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Discriminating Systems - Gender, Race, and Power in AI

Discriminating Systems - Gender, Race, and Power in AI

Blockwise Parallel Decoding for Deep Autoregressive Models

Blockwise Parallel Decoding for Deep Autoregressive Models

S.H.E. - Search. Human. Equalizer.

S.H.E. - Search. Human. Equalizer.

Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow

Adversarial Examples Are Not Bugs, They Are Features

Adversarial Examples Are Not Bugs, They Are Features

I'm at ICML19 :)

I'm at ICML19 :)

Population-Based Search and Open-Ended Algorithms

Population-Based Search and Open-Ended Algorithms

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Conversation about Population-Based Methods (Re-upload)

Conversation about Population-Based Methods (Re-upload)

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Manifold Mixup: Better Representations by Interpolating Hidden States

Manifold Mixup: Better Representations by Interpolating Hidden States

Processing Megapixel Images with Deep Attention-Sampling Models

Processing Megapixel Images with Deep Attention-Sampling Models

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Auditing Radicalization Pathways on YouTube

Auditing Radicalization Pathways on YouTube

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules

DEEP LEARNING MEME REVIEW - Episode 1

DEEP LEARNING MEME REVIEW - Episode 1

Accelerating Deep Learning by Focusing on the Biggest Losers

Accelerating Deep Learning by Focusing on the Biggest Losers

[News] The Siraj Raval Controversy

[News] The Siraj Raval Controversy

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

The Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model learns successor representations in partially observable environments

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

NeurIPS 19 Poster Session

NeurIPS 19 Poster Session

Go-Explore: a New Approach for Hard-Exploration Problems

Go-Explore: a New Approach for Hard-Exploration Problems

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

Turing-NLG, DeepSpeed and the ZeRO optimizer

Turing-NLG, DeepSpeed and the ZeRO optimizer

Growing Neural Cellular Automata

Growing Neural Cellular Automata

NeurIPS 2020 Changes to Paper Submission Process

NeurIPS 2020 Changes to Paper Submission Process

Deep Learning for Symbolic Mathematics

Deep Learning for Symbolic Mathematics

Online Education - How I Make My Videos

Online Education - How I Make My Videos

[Rant] coronavirus

[Rant] coronavirus

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Agent57: Outperforming the Atari Human Benchmark

Agent57: Outperforming the Atari Human Benchmark

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

Dream to Control: Learning Behaviors by Latent Imagination

Dream to Control: Learning Behaviors by Latent Imagination

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

Evaluating NLP Models via Contrast Sets

Evaluating NLP Models via Contrast Sets

[Drama] Who invented Contrast Sets?

[Drama] Who invented Contrast Sets?

This video teaches how to learn behaviors by latent imagination using Dreamer, a new RL agent by DeepMind, and how to implement policy learning in latent space using techniques like representation learning and reinforcement learning. The video covers the basics of reinforcement learning, continuous control tasks, and latent imagination, and provides a comprehensive overview of the Dreamer agent and its applications.

Key Takeaways

Collect experience data
Draw data sequences from experience
Compute model states
Update parameters using representation learning
Use a model to imagine future trajectories from each of the states
Determine actions using the policy
Update the parameters using a gradient of the output minus the true value
Update value function using gradient of its output minus true value
Update action model to maximize value function
Compute value estimate using equation 6

💡 The Dreamer agent uses latent imagination to learn behaviors and can be applied to continuous control tasks, providing a new approach to reinforcement learning.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Agent Foundations

View skill →

Build and Deploy an Agent with Reasoning Engine in Vertex AI

Adding a Phone Gateway to a Virtual Agent

From Zero to Working AI Agent in 60 Seconds

From Zero to Working AI Agent in 60 Seconds

Create An AI Agent With Replit That Automates Your Sales

Create An AI Agent With Replit That Automates Your Sales

Capstone: Autonomous Runway Detection for IoT

Capstone: Autonomous Runway Detection for IoT

AI Agents with Model Context Protocol & Typescript

AI Agents with Model Context Protocol & Typescript

Related Reads

5 Ways To Build An AI-Positive Workplace Before Fear Takes Over

Build an AI-positive workplace by addressing fear and uncertainty through 5 practical steps, fostering innovation and a positive corporate culture

Forbes Innovation

Industry 5.0 Won't Be Won by More Dashboards. It'll Be Won by Faster Decisions.

Industry 5.0 will be driven by faster decision-making, not more dashboards or automation, and manufacturers must focus on leveraging AI for decision-making, not just data collection

OpenAI's Assistants API shuts down August 26 — but the silent failures hit weeks earlier, when you migrate

Migrate from OpenAI's Assistants API to Responses API before August 26 to avoid silent failures and ensure a smooth transition

I ran Anthropic's official MCP server in a gVisor sandbox — here's what happened

Learn how to run Anthropic's official MCP server in a gVisor sandbox and explore the possibilities of Model Context Protocol (MCP) in a secure environment

Dev.to · Edison Flores

AI Agents Are Starting to Talk to Each Other... Without Us.