Reinforcement Learning with Unsupervised Auxiliary Tasks

Yannic Kilcher · Advanced ·📄 Research Papers Explained ·8y ago

Key Takeaways

The video discusses the paper 'Reinforcement Learning with Unsupervised Auxiliary Tasks' which proposes using auxiliary tasks to improve learning in sparse reward environments, with techniques such as pixel changes and network features. The paper evaluates the use of auxiliary tasks, including database agent and reward prediction, and discusses the challenges of comparing improvements due to the implementation of multiple techniques.

Full Transcript

hi there today we're looking at reinforcement learning with unsupervised auxiliary tasks by Google so in this paper the author's consider a reinforcement learning task and I can show you what it looks like it looks like this kind of a maze or this is an example that they give where you have to navigate the maze it's 3d and you have to navigate from pixel inputs you have to collect apples and reach the goal and this gives you rewards so on the left you can see what the agents actually see on the right you can see it from a top-down view the problem is of course that the input is very or the reward is very sparse meaning that you have to navigate a lot of maze before you even get a single point so reinforcement learning has a big trouble with this because it relies on constant reward to notice what actions are good and what actions are bad so the author's proposes in addition to the regular loss and that you would have so your reward which is this thing you would also have an additional set of auxiliary tasks and here C goes over the observe observe you control tasks that you specify each of those has a reward and you're also trying to maximize these each with with some kind of a weight here and the thing is that the parameters that you maximize over control all of the different tasks so they are partly shared between the tasks so what you're hoping is that by kind of learning to do one thing you also learn to do another thing so the difference between this and let's say you might have so we've seen kind of work of it like this before where you do it in more like an autoencoder setting so for example you can't agencies the input on the left here and it kind of tries to predict what the next in but we'll be what the next frame will be developed behind this is if you can accurately predict what the next frame will be maybe it learned something useful about the environment in this work it's different because now we couple a reward to these tasks and I can show you here what the authors propose as additional rewards sorry they're further on top let me go there especially they considered here these two auxiliary control tasks so pixel changes which means that the agent actually tries to actively change pixels so it gets a reward for changing the pixels in the input so it tries to maximize this it needs to learn what do I need to do to maximize my pixel changes and probably that will be moving around so we will learn to kind of move around not move against the wall because if it moves against the wall the pixels pixels won't change so it will kind of learn to move along the the like how a regular human agent would also move speak not into a wall not like into a dead end or something such that the pixels always change of course it's not perfect you can also change your pixels quite a bit by simply spinning around in a circle but this is one of the early tasks that they are meant the agent with the other one is Network features so it's kind of a meta learning here you actually reward the agent for changing its own internal activations so the hope is that it kind of learns about something about itself how can i activate my internal neural network units and it gets rewarded for that so we might want to activate a lot of them and want to learn how they're activated so this kind of self introspection you also hope that it kind of leads to a network does more sophisticated tasks or that by nature of trying to get most pixel pixel changes and the most network feature activated that you also learn something useful for the actual task um so these are the two tasks they propose in addition they also do and they have a drawing this over here they also do a lot of other things namely on the top left you can kind of see here that what's a database agent this is an a3 see agent meaning that it's an it's an active critic so you learn a policy and you learn a value network we might go over this in a future video school just consider this a standard reinforcement learning agent you feed its experience into a replay buffer and out of the replay buffer you do many things so for one you try to learn these auxiliary tasks note that these are shared parameters between all these networks that's why I do daily tasks actually help you also try to better learn your value function and they call this off policy learning because you kind of pause the reciting training for a while and then you train the value function some more just because that helps you also try a reward prediction in here and the way they do it as I explained is kind of in a skewed sampling way so how do all the situation's you can be in the agent will have a reward very very few times so what they do is they simply sample out of the replay buffer out of all the experiences they had so far they sample more frequently the experiences where they actually gotten a reward that way that the whole is of course the agent if you if you look at when you can zoom in here if you look at the the experience here where you actually get an apple and the agent might learn a lot faster or there's some kind of Apple there and I move towards get a reward so that's the the hope that you instantly recognize high reward situations and kanda are not so interested in non reward situations of course it doesn't reduce bias in your sampling and you might decide for yourself if that's good or bad here it seems to work so there's a lot of experiments in this task and labyrinth tasks and they of course as with research they read state of the art they're much better than anything else no I mean they don't boast as much so it's actually a fair comparisons the criticisms so they also evaluate a motor against the criticisms that I have are twofold first of all the choice of ability tasks is completely up to the implementer which means that I have to decide as an implementer of this algorithm what my Tillery tasks will be and here pixel changes and Network features they seem like fairly general tasks that you could apply to a lot of these kind of problems but it always kind of comes down to how much knowledge about the task would you like to go into the into the actor and here I mean you can see it makes sense to get at least the pixel changes as an auxiliary task but it's questionable how much of kind of domain knowledge this already encodes so the fact the choice of these are certainly something that you have to decide as a human and I think these are these are good choices so they're not too domain specific but also they do correspond to like some visual moving around game tasks and the other um kind of criticisms not really criticism is just a remark is that they do a lot of a lot of things so I mean the paper is about the auxiliary tasks but they also then do these skimmed sampling and the policy value learning and so on and of course you can kind of argue yeah this is all done you know the reinforcement learning tasks that's why it's a fair comparison I guess it's a philosophical question if you want to reach state of the art of course you have to first of all get a better a better method here this would be the auxiliary tasks this is the new idea and then implement all the tricks that the the other people have discovered which is good because you kind of reach the highest performance you can get but also the problem is you make it harder to compare you make it harder to see where the improvement is coming from have you simply chosen better high parameters for the reward predictions of things have you simply is there any interactions maybe between the auxiliary tasks and dispute sampling part all these kind of things wash out and it's not really clear where the improvement is coming from on the other hand if you simply take a basic basic basic algorithm like just a three see here on the top left and you augment it with nothing but these are the early tasks the bottom left then and then you see an improvement you can be relatively sure it's due to your new idea but of course you won't reach any state-of-the-art numbers because everyone that does a3 see also does these tricks philosophical question Here I am standing more on the side of not doing the tricks or maybe doing both yeah decide for yourself and have a nice day

Original Description

https://arxiv.org/abs/1611.05397 Abstract: Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10× and averaging 87\% expert human performance on Labyrinth. Authors: Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 3 of 60

1 Imagination-Augmented Agents for Deep Reinforcement Learning
Imagination-Augmented Agents for Deep Reinforcement Learning
Yannic Kilcher
2 Learning model-based planning from scratch
Learning model-based planning from scratch
Yannic Kilcher
Reinforcement Learning with Unsupervised Auxiliary Tasks
Reinforcement Learning with Unsupervised Auxiliary Tasks
Yannic Kilcher
4 Attention Is All You Need
Attention Is All You Need
Yannic Kilcher
5 git for research basics: fundamentals, commits, branches, merging
git for research basics: fundamentals, commits, branches, merging
Yannic Kilcher
6 Curiosity-driven Exploration by Self-supervised Prediction
Curiosity-driven Exploration by Self-supervised Prediction
Yannic Kilcher
7 World Models
World Models
Yannic Kilcher
8 Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Yannic Kilcher
9 Stochastic RNNs without Teacher-Forcing
Stochastic RNNs without Teacher-Forcing
Yannic Kilcher
10 What’s in a name? The need to nip NIPS
What’s in a name? The need to nip NIPS
Yannic Kilcher
11 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Yannic Kilcher
12 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Yannic Kilcher
13 GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Yannic Kilcher
14 Neural Ordinary Differential Equations
Neural Ordinary Differential Equations
Yannic Kilcher
15 The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
Yannic Kilcher
16 Discriminating Systems - Gender, Race, and Power in AI
Discriminating Systems - Gender, Race, and Power in AI
Yannic Kilcher
17 Blockwise Parallel Decoding for Deep Autoregressive Models
Blockwise Parallel Decoding for Deep Autoregressive Models
Yannic Kilcher
18 S.H.E. - Search. Human. Equalizer.
S.H.E. - Search. Human. Equalizer.
Yannic Kilcher
19 Reinforcement Learning, Fast and Slow
Reinforcement Learning, Fast and Slow
Yannic Kilcher
20 Adversarial Examples Are Not Bugs, They Are Features
Adversarial Examples Are Not Bugs, They Are Features
Yannic Kilcher
21 I'm at ICML19 :)
I'm at ICML19 :)
Yannic Kilcher
22 Population-Based Search and Open-Ended Algorithms
Population-Based Search and Open-Ended Algorithms
Yannic Kilcher
23 XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Yannic Kilcher
24 Conversation about Population-Based Methods (Re-upload)
Conversation about Population-Based Methods (Re-upload)
Yannic Kilcher
25 Reconciling modern machine learning and the bias-variance trade-off
Reconciling modern machine learning and the bias-variance trade-off
Yannic Kilcher
26 Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
Yannic Kilcher
27 Manifold Mixup: Better Representations by Interpolating Hidden States
Manifold Mixup: Better Representations by Interpolating Hidden States
Yannic Kilcher
28 Processing Megapixel Images with Deep Attention-Sampling Models
Processing Megapixel Images with Deep Attention-Sampling Models
Yannic Kilcher
29 Gauge Equivariant Convolutional Networks and the Icosahedral CNN
Gauge Equivariant Convolutional Networks and the Icosahedral CNN
Yannic Kilcher
30 Auditing Radicalization Pathways on YouTube
Auditing Radicalization Pathways on YouTube
Yannic Kilcher
31 RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yannic Kilcher
32 Dynamic Routing Between Capsules
Dynamic Routing Between Capsules
Yannic Kilcher
33 DEEP LEARNING MEME REVIEW - Episode 1
DEEP LEARNING MEME REVIEW - Episode 1
Yannic Kilcher
34 Accelerating Deep Learning by Focusing on the Biggest Losers
Accelerating Deep Learning by Focusing on the Biggest Losers
Yannic Kilcher
35 [News] The Siraj Raval Controversy
[News] The Siraj Raval Controversy
Yannic Kilcher
36 LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games
LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games
Yannic Kilcher
37 The Visual Task Adaptation Benchmark
The Visual Task Adaptation Benchmark
Yannic Kilcher
38 IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Yannic Kilcher
39 AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Yannic Kilcher
40 SinGAN: Learning a Generative Model from a Single Natural Image
SinGAN: Learning a Generative Model from a Single Natural Image
Yannic Kilcher
41 A neurally plausible model learns successor representations in partially observable environments
A neurally plausible model learns successor representations in partially observable environments
Yannic Kilcher
42 MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Yannic Kilcher
43 Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
Yannic Kilcher
44 NeurIPS 19 Poster Session
NeurIPS 19 Poster Session
Yannic Kilcher
45 Go-Explore: a New Approach for Hard-Exploration Problems
Go-Explore: a New Approach for Hard-Exploration Problems
Yannic Kilcher
46 Reformer: The Efficient Transformer
Reformer: The Efficient Transformer
Yannic Kilcher
47 [Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization
[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization
Yannic Kilcher
48 Turing-NLG, DeepSpeed and the ZeRO optimizer
Turing-NLG, DeepSpeed and the ZeRO optimizer
Yannic Kilcher
49 Growing Neural Cellular Automata
Growing Neural Cellular Automata
Yannic Kilcher
50 NeurIPS 2020 Changes to Paper Submission Process
NeurIPS 2020 Changes to Paper Submission Process
Yannic Kilcher
51 Deep Learning for Symbolic Mathematics
Deep Learning for Symbolic Mathematics
Yannic Kilcher
52 Online Education - How I Make My Videos
Online Education - How I Make My Videos
Yannic Kilcher
53 [Rant] coronavirus
[Rant] coronavirus
Yannic Kilcher
54 Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting
Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting
Yannic Kilcher
55 Agent57: Outperforming the Atari Human Benchmark
Agent57: Outperforming the Atari Human Benchmark
Yannic Kilcher
56 State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication
State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication
Yannic Kilcher
57 Dream to Control: Learning Behaviors by Latent Imagination
Dream to Control: Learning Behaviors by Latent Imagination
Yannic Kilcher
58 POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions
POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions
Yannic Kilcher
59 Evaluating NLP Models via Contrast Sets
Evaluating NLP Models via Contrast Sets
Yannic Kilcher
60 [Drama] Who invented Contrast Sets?
[Drama] Who invented Contrast Sets?
Yannic Kilcher

The video discusses the use of unsupervised auxiliary tasks to improve reinforcement learning performance in sparse reward environments. The paper proposes using auxiliary tasks such as pixel changes and network features, and evaluates the effectiveness of these tasks in improving learning. The video also discusses the challenges of comparing improvements due to the implementation of multiple techniques.

Key Takeaways
  1. Read the paper 'Reinforcement Learning with Unsupervised Auxiliary Tasks'
  2. Implement auxiliary tasks in a reinforcement learning algorithm
  3. Evaluate the effectiveness of auxiliary tasks in improving learning
  4. Compare the performance of different reinforcement learning algorithms
  5. Analyze the impact of auxiliary tasks on reinforcement learning performance
💡 The use of unsupervised auxiliary tasks can improve reinforcement learning performance in sparse reward environments, but the implementation of multiple techniques can make it harder to compare improvements.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →