Reinforcement Learning with Unsupervised Auxiliary Tasks
Key Takeaways
The video discusses the paper 'Reinforcement Learning with Unsupervised Auxiliary Tasks' which proposes using auxiliary tasks to improve learning in sparse reward environments, with techniques such as pixel changes and network features. The paper evaluates the use of auxiliary tasks, including database agent and reward prediction, and discusses the challenges of comparing improvements due to the implementation of multiple techniques.
Full Transcript
hi there today we're looking at reinforcement learning with unsupervised auxiliary tasks by Google so in this paper the author's consider a reinforcement learning task and I can show you what it looks like it looks like this kind of a maze or this is an example that they give where you have to navigate the maze it's 3d and you have to navigate from pixel inputs you have to collect apples and reach the goal and this gives you rewards so on the left you can see what the agents actually see on the right you can see it from a top-down view the problem is of course that the input is very or the reward is very sparse meaning that you have to navigate a lot of maze before you even get a single point so reinforcement learning has a big trouble with this because it relies on constant reward to notice what actions are good and what actions are bad so the author's proposes in addition to the regular loss and that you would have so your reward which is this thing you would also have an additional set of auxiliary tasks and here C goes over the observe observe you control tasks that you specify each of those has a reward and you're also trying to maximize these each with with some kind of a weight here and the thing is that the parameters that you maximize over control all of the different tasks so they are partly shared between the tasks so what you're hoping is that by kind of learning to do one thing you also learn to do another thing so the difference between this and let's say you might have so we've seen kind of work of it like this before where you do it in more like an autoencoder setting so for example you can't agencies the input on the left here and it kind of tries to predict what the next in but we'll be what the next frame will be developed behind this is if you can accurately predict what the next frame will be maybe it learned something useful about the environment in this work it's different because now we couple a reward to these tasks and I can show you here what the authors propose as additional rewards sorry they're further on top let me go there especially they considered here these two auxiliary control tasks so pixel changes which means that the agent actually tries to actively change pixels so it gets a reward for changing the pixels in the input so it tries to maximize this it needs to learn what do I need to do to maximize my pixel changes and probably that will be moving around so we will learn to kind of move around not move against the wall because if it moves against the wall the pixels pixels won't change so it will kind of learn to move along the the like how a regular human agent would also move speak not into a wall not like into a dead end or something such that the pixels always change of course it's not perfect you can also change your pixels quite a bit by simply spinning around in a circle but this is one of the early tasks that they are meant the agent with the other one is Network features so it's kind of a meta learning here you actually reward the agent for changing its own internal activations so the hope is that it kind of learns about something about itself how can i activate my internal neural network units and it gets rewarded for that so we might want to activate a lot of them and want to learn how they're activated so this kind of self introspection you also hope that it kind of leads to a network does more sophisticated tasks or that by nature of trying to get most pixel pixel changes and the most network feature activated that you also learn something useful for the actual task um so these are the two tasks they propose in addition they also do and they have a drawing this over here they also do a lot of other things namely on the top left you can kind of see here that what's a database agent this is an a3 see agent meaning that it's an it's an active critic so you learn a policy and you learn a value network we might go over this in a future video school just consider this a standard reinforcement learning agent you feed its experience into a replay buffer and out of the replay buffer you do many things so for one you try to learn these auxiliary tasks note that these are shared parameters between all these networks that's why I do daily tasks actually help you also try to better learn your value function and they call this off policy learning because you kind of pause the reciting training for a while and then you train the value function some more just because that helps you also try a reward prediction in here and the way they do it as I explained is kind of in a skewed sampling way so how do all the situation's you can be in the agent will have a reward very very few times so what they do is they simply sample out of the replay buffer out of all the experiences they had so far they sample more frequently the experiences where they actually gotten a reward that way that the whole is of course the agent if you if you look at when you can zoom in here if you look at the the experience here where you actually get an apple and the agent might learn a lot faster or there's some kind of Apple there and I move towards get a reward so that's the the hope that you instantly recognize high reward situations and kanda are not so interested in non reward situations of course it doesn't reduce bias in your sampling and you might decide for yourself if that's good or bad here it seems to work so there's a lot of experiments in this task and labyrinth tasks and they of course as with research they read state of the art they're much better than anything else no I mean they don't boast as much so it's actually a fair comparisons the criticisms so they also evaluate a motor against the criticisms that I have are twofold first of all the choice of ability tasks is completely up to the implementer which means that I have to decide as an implementer of this algorithm what my Tillery tasks will be and here pixel changes and Network features they seem like fairly general tasks that you could apply to a lot of these kind of problems but it always kind of comes down to how much knowledge about the task would you like to go into the into the actor and here I mean you can see it makes sense to get at least the pixel changes as an auxiliary task but it's questionable how much of kind of domain knowledge this already encodes so the fact the choice of these are certainly something that you have to decide as a human and I think these are these are good choices so they're not too domain specific but also they do correspond to like some visual moving around game tasks and the other um kind of criticisms not really criticism is just a remark is that they do a lot of a lot of things so I mean the paper is about the auxiliary tasks but they also then do these skimmed sampling and the policy value learning and so on and of course you can kind of argue yeah this is all done you know the reinforcement learning tasks that's why it's a fair comparison I guess it's a philosophical question if you want to reach state of the art of course you have to first of all get a better a better method here this would be the auxiliary tasks this is the new idea and then implement all the tricks that the the other people have discovered which is good because you kind of reach the highest performance you can get but also the problem is you make it harder to compare you make it harder to see where the improvement is coming from have you simply chosen better high parameters for the reward predictions of things have you simply is there any interactions maybe between the auxiliary tasks and dispute sampling part all these kind of things wash out and it's not really clear where the improvement is coming from on the other hand if you simply take a basic basic basic algorithm like just a three see here on the top left and you augment it with nothing but these are the early tasks the bottom left then and then you see an improvement you can be relatively sure it's due to your new idea but of course you won't reach any state-of-the-art numbers because everyone that does a3 see also does these tricks philosophical question Here I am standing more on the side of not doing the tricks or maybe doing both yeah decide for yourself and have a nice day
Original Description
https://arxiv.org/abs/1611.05397
Abstract:
Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10× and averaging 87\% expert human performance on Labyrinth.
Authors:
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Yannic Kilcher · Yannic Kilcher · 3 of 60
1
2
▶
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Imagination-Augmented Agents for Deep Reinforcement Learning
Yannic Kilcher
Learning model-based planning from scratch
Yannic Kilcher
Reinforcement Learning with Unsupervised Auxiliary Tasks
Yannic Kilcher
Attention Is All You Need
Yannic Kilcher
git for research basics: fundamentals, commits, branches, merging
Yannic Kilcher
Curiosity-driven Exploration by Self-supervised Prediction
Yannic Kilcher
World Models
Yannic Kilcher
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Yannic Kilcher
Stochastic RNNs without Teacher-Forcing
Yannic Kilcher
What’s in a name? The need to nip NIPS
Yannic Kilcher
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Yannic Kilcher
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Yannic Kilcher
GPT-2: Language Models are Unsupervised Multitask Learners
Yannic Kilcher
Neural Ordinary Differential Equations
Yannic Kilcher
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
Yannic Kilcher
Discriminating Systems - Gender, Race, and Power in AI
Yannic Kilcher
Blockwise Parallel Decoding for Deep Autoregressive Models
Yannic Kilcher
S.H.E. - Search. Human. Equalizer.
Yannic Kilcher
Reinforcement Learning, Fast and Slow
Yannic Kilcher
Adversarial Examples Are Not Bugs, They Are Features
Yannic Kilcher
I'm at ICML19 :)
Yannic Kilcher
Population-Based Search and Open-Ended Algorithms
Yannic Kilcher
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Yannic Kilcher
Conversation about Population-Based Methods (Re-upload)
Yannic Kilcher
Reconciling modern machine learning and the bias-variance trade-off
Yannic Kilcher
Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
Yannic Kilcher
Manifold Mixup: Better Representations by Interpolating Hidden States
Yannic Kilcher
Processing Megapixel Images with Deep Attention-Sampling Models
Yannic Kilcher
Gauge Equivariant Convolutional Networks and the Icosahedral CNN
Yannic Kilcher
Auditing Radicalization Pathways on YouTube
Yannic Kilcher
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yannic Kilcher
Dynamic Routing Between Capsules
Yannic Kilcher
DEEP LEARNING MEME REVIEW - Episode 1
Yannic Kilcher
Accelerating Deep Learning by Focusing on the Biggest Losers
Yannic Kilcher
[News] The Siraj Raval Controversy
Yannic Kilcher
LeDeepChef 👨🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games
Yannic Kilcher
The Visual Task Adaptation Benchmark
Yannic Kilcher
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Yannic Kilcher
AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Yannic Kilcher
SinGAN: Learning a Generative Model from a Single Natural Image
Yannic Kilcher
A neurally plausible model learns successor representations in partially observable environments
Yannic Kilcher
MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Yannic Kilcher
Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
Yannic Kilcher
NeurIPS 19 Poster Session
Yannic Kilcher
Go-Explore: a New Approach for Hard-Exploration Problems
Yannic Kilcher
Reformer: The Efficient Transformer
Yannic Kilcher
[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization
Yannic Kilcher
Turing-NLG, DeepSpeed and the ZeRO optimizer
Yannic Kilcher
Growing Neural Cellular Automata
Yannic Kilcher
NeurIPS 2020 Changes to Paper Submission Process
Yannic Kilcher
Deep Learning for Symbolic Mathematics
Yannic Kilcher
Online Education - How I Make My Videos
Yannic Kilcher
[Rant] coronavirus
Yannic Kilcher
Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting
Yannic Kilcher
Agent57: Outperforming the Atari Human Benchmark
Yannic Kilcher
State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication
Yannic Kilcher
Dream to Control: Learning Behaviors by Latent Imagination
Yannic Kilcher
POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions
Yannic Kilcher
Evaluating NLP Models via Contrast Sets
Yannic Kilcher
[Drama] Who invented Contrast Sets?
Yannic Kilcher
More on: Research Methods
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI