Reinforcement Learning with Unsupervised Auxiliary Tasks

Yannic Kilcher · Advanced ·📄 Research Papers Explained ·8y ago

Skills: Research Methods90%Reading ML Papers80%RAG Evaluation70%

Key Takeaways

The video discusses the paper 'Reinforcement Learning with Unsupervised Auxiliary Tasks' which proposes using auxiliary tasks to improve learning in sparse reward environments, with techniques such as pixel changes and network features. The paper evaluates the use of auxiliary tasks, including database agent and reward prediction, and discusses the challenges of comparing improvements due to the implementation of multiple techniques.

Full Transcript

hi there today we're looking at reinforcement learning with unsupervised auxiliary tasks by Google so in this paper the author's consider a reinforcement learning task and I can show you what it looks like it looks like this kind of a maze or this is an example that they give where you have to navigate the maze it's 3d and you have to navigate from pixel inputs you have to collect apples and reach the goal and this gives you rewards so on the left you can see what the agents actually see on the right you can see it from a top-down view the problem is of course that the input is very or the reward is very sparse meaning that you have to navigate a lot of maze before you even get a single point so reinforcement learning has a big trouble with this because it relies on constant reward to notice what actions are good and what actions are bad so the author's proposes in addition to the regular loss and that you would have so your reward which is this thing you would also have an additional set of auxiliary tasks and here C goes over the observe observe you control tasks that you specify each of those has a reward and you're also trying to maximize these each with with some kind of a weight here and the thing is that the parameters that you maximize over control all of the different tasks so they are partly shared between the tasks so what you're hoping is that by kind of learning to do one thing you also learn to do another thing so the difference between this and let's say you might have so we've seen kind of work of it like this before where you do it in more like an autoencoder setting so for example you can't agencies the input on the left here and it kind of tries to predict what the next in but we'll be what the next frame will be developed behind this is if you can accurately predict what the next frame will be maybe it learned something useful about the environment in this work it's different because now we couple a reward to these tasks and I can show you here what the authors propose as additional rewards sorry they're further on top let me go there especially they considered here these two auxiliary control tasks so pixel changes which means that the agent actually tries to actively change pixels so it gets a reward for changing the pixels in the input so it tries to maximize this it needs to learn what do I need to do to maximize my pixel changes and probably that will be moving around so we will learn to kind of move around not move against the wall because if it moves against the wall the pixels pixels won't change so it will kind of learn to move along the the like how a regular human agent would also move speak not into a wall not like into a dead end or something such that the pixels always change of course it's not perfect you can also change your pixels quite a bit by simply spinning around in a circle but this is one of the early tasks that they are meant the agent with the other one is Network features so it's kind of a meta learning here you actually reward the agent for changing its own internal activations so the hope is that it kind of learns about something about itself how can i activate my internal neural network units and it gets rewarded for that so we might want to activate a lot of them and want to learn how they're activated so this kind of self introspection you also hope that it kind of leads to a network does more sophisticated tasks or that by nature of trying to get most pixel pixel changes and the most network feature activated that you also learn something useful for the actual task um so these are the two tasks they propose in addition they also do and they have a drawing this over here they also do a lot of other things namely on the top left you can kind of see here that what's a database agent this is an a3 see agent meaning that it's an it's an active critic so you learn a policy and you learn a value network we might go over this in a future video school just consider this a standard reinforcement learning agent you feed its experience into a replay buffer and out of the replay buffer you do many things so for one you try to learn these auxiliary tasks note that these are shared parameters between all these networks that's why I do daily tasks actually help you also try to better learn your value function and they call this off policy learning because you kind of pause the reciting training for a while and then you train the value function some more just because that helps you also try a reward prediction in here and the way they do it as I explained is kind of in a skewed sampling way so how do all the situation's you can be in the agent will have a reward very very few times so what they do is they simply sample out of the replay buffer out of all the experiences they had so far they sample more frequently the experiences where they actually gotten a reward that way that the whole is of course the agent if you if you look at when you can zoom in here if you look at the the experience here where you actually get an apple and the agent might learn a lot faster or there's some kind of Apple there and I move towards get a reward so that's the the hope that you instantly recognize high reward situations and kanda are not so interested in non reward situations of course it doesn't reduce bias in your sampling and you might decide for yourself if that's good or bad here it seems to work so there's a lot of experiments in this task and labyrinth tasks and they of course as with research they read state of the art they're much better than anything else no I mean they don't boast as much so it's actually a fair comparisons the criticisms so they also evaluate a motor against the criticisms that I have are twofold first of all the choice of ability tasks is completely up to the implementer which means that I have to decide as an implementer of this algorithm what my Tillery tasks will be and here pixel changes and Network features they seem like fairly general tasks that you could apply to a lot of these kind of problems but it always kind of comes down to how much knowledge about the task would you like to go into the into the actor and here I mean you can see it makes sense to get at least the pixel changes as an auxiliary task but it's questionable how much of kind of domain knowledge this already encodes so the fact the choice of these are certainly something that you have to decide as a human and I think these are these are good choices so they're not too domain specific but also they do correspond to like some visual moving around game tasks and the other um kind of criticisms not really criticism is just a remark is that they do a lot of a lot of things so I mean the paper is about the auxiliary tasks but they also then do these skimmed sampling and the policy value learning and so on and of course you can kind of argue yeah this is all done you know the reinforcement learning tasks that's why it's a fair comparison I guess it's a philosophical question if you want to reach state of the art of course you have to first of all get a better a better method here this would be the auxiliary tasks this is the new idea and then implement all the tricks that the the other people have discovered which is good because you kind of reach the highest performance you can get but also the problem is you make it harder to compare you make it harder to see where the improvement is coming from have you simply chosen better high parameters for the reward predictions of things have you simply is there any interactions maybe between the auxiliary tasks and dispute sampling part all these kind of things wash out and it's not really clear where the improvement is coming from on the other hand if you simply take a basic basic basic algorithm like just a three see here on the top left and you augment it with nothing but these are the early tasks the bottom left then and then you see an improvement you can be relatively sure it's due to your new idea but of course you won't reach any state-of-the-art numbers because everyone that does a3 see also does these tricks philosophical question Here I am standing more on the side of not doing the tricks or maybe doing both yeah decide for yourself and have a nice day

Original Description

https://arxiv.org/abs/1611.05397 Abstract: Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10× and averaging 87\% expert human performance on Labyrinth. Authors: Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 3 of 60

← Previous Next →

Imagination-Augmented Agents for Deep Reinforcement Learning

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Learning model-based planning from scratch

Reinforcement Learning with Unsupervised Auxiliary Tasks

Reinforcement Learning with Unsupervised Auxiliary Tasks

Attention Is All You Need

Attention Is All You Need

git for research basics: fundamentals, commits, branches, merging

git for research basics: fundamentals, commits, branches, merging

Curiosity-driven Exploration by Self-supervised Prediction

Curiosity-driven Exploration by Self-supervised Prediction

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Stochastic RNNs without Teacher-Forcing

Stochastic RNNs without Teacher-Forcing

What’s in a name? The need to nip NIPS

What’s in a name? The need to nip NIPS

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

GPT-2: Language Models are Unsupervised Multitask Learners

GPT-2: Language Models are Unsupervised Multitask Learners

Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Discriminating Systems - Gender, Race, and Power in AI

Discriminating Systems - Gender, Race, and Power in AI

Blockwise Parallel Decoding for Deep Autoregressive Models

Blockwise Parallel Decoding for Deep Autoregressive Models

S.H.E. - Search. Human. Equalizer.

S.H.E. - Search. Human. Equalizer.

Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow

Adversarial Examples Are Not Bugs, They Are Features

Adversarial Examples Are Not Bugs, They Are Features

I'm at ICML19 :)

I'm at ICML19 :)

Population-Based Search and Open-Ended Algorithms

Population-Based Search and Open-Ended Algorithms

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Conversation about Population-Based Methods (Re-upload)

Conversation about Population-Based Methods (Re-upload)

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Manifold Mixup: Better Representations by Interpolating Hidden States

Manifold Mixup: Better Representations by Interpolating Hidden States

Processing Megapixel Images with Deep Attention-Sampling Models

Processing Megapixel Images with Deep Attention-Sampling Models

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Auditing Radicalization Pathways on YouTube

Auditing Radicalization Pathways on YouTube

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules

DEEP LEARNING MEME REVIEW - Episode 1

DEEP LEARNING MEME REVIEW - Episode 1

Accelerating Deep Learning by Focusing on the Biggest Losers

Accelerating Deep Learning by Focusing on the Biggest Losers

[News] The Siraj Raval Controversy

[News] The Siraj Raval Controversy

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

The Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model learns successor representations in partially observable environments

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

NeurIPS 19 Poster Session

NeurIPS 19 Poster Session

Go-Explore: a New Approach for Hard-Exploration Problems

Go-Explore: a New Approach for Hard-Exploration Problems

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

Turing-NLG, DeepSpeed and the ZeRO optimizer

Turing-NLG, DeepSpeed and the ZeRO optimizer

Growing Neural Cellular Automata

Growing Neural Cellular Automata

NeurIPS 2020 Changes to Paper Submission Process

NeurIPS 2020 Changes to Paper Submission Process

Deep Learning for Symbolic Mathematics

Deep Learning for Symbolic Mathematics

Online Education - How I Make My Videos

Online Education - How I Make My Videos

[Rant] coronavirus

[Rant] coronavirus

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Agent57: Outperforming the Atari Human Benchmark

Agent57: Outperforming the Atari Human Benchmark

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

Dream to Control: Learning Behaviors by Latent Imagination

Dream to Control: Learning Behaviors by Latent Imagination

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

Evaluating NLP Models via Contrast Sets

Evaluating NLP Models via Contrast Sets

[Drama] Who invented Contrast Sets?

[Drama] Who invented Contrast Sets?

The video discusses the use of unsupervised auxiliary tasks to improve reinforcement learning performance in sparse reward environments. The paper proposes using auxiliary tasks such as pixel changes and network features, and evaluates the effectiveness of these tasks in improving learning. The video also discusses the challenges of comparing improvements due to the implementation of multiple techniques.

Key Takeaways

Read the paper 'Reinforcement Learning with Unsupervised Auxiliary Tasks'
Implement auxiliary tasks in a reinforcement learning algorithm
Evaluate the effectiveness of auxiliary tasks in improving learning
Compare the performance of different reinforcement learning algorithms
Analyze the impact of auxiliary tasks on reinforcement learning performance

💡 The use of unsupervised auxiliary tasks can improve reinforcement learning performance in sparse reward environments, but the implementation of multiple techniques can make it harder to compare improvements.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related Reads

Follow-up: The ArxivLens Protocol: Transforming Research Nois

Learn how to apply the ArxivLens Protocol to create dynamic grant-allocation pools that rebalance based on citation-impact signals, transforming research noise into actionable insights

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

How to get started With Drug Discovery using BioAI: Computational Biology ( 4K UHD Med Masterclass )

Sudarshan's Multiverse