Multi-Task Curriculum Learning in Minecraft

Connor Shorten · Beginner ·🎨 Image & Video AI ·4y ago

Skills: Multimodal LLMs90%Tool Use & Function Calling90%Prompt Craft80%Agent Foundations80%Advanced Prompting70%

Key Takeaways

The video demonstrates multi-task curriculum learning in Minecraft, utilizing tools such as Minecraft and Open AI, and explores concepts like reinforced learning, goal conditioning, and exploration bonus. It highlights the benefits of adding a curriculum to the learning process and introduces a bi-directional approach to combining tasks.

Full Transcript

this video will explain multi-task curriculum learning in a complex visual hard exploration domain minecraft so as a quick overview of the presentation of this paper we're going to start off with some quick takeaways looking at the final different curriculums that were proposed to discover these items in the minecraft environment and then we'll talk about reinforced learning curriculum learning and multi-task learning more generally this idea of having task dependency where you have to learn to say add before you do calculus and these kind of dependencies with multitask learning and reinforced learning then we'll discuss how they measure learnability as the change in success probability and some problems with just using it as raw success probability and then we'll look at their expiration bonus and the issues with goal conditioning goal conditioning is where i say the policy is not only conditioning its action prediction on the state but also the goal is given as the input so it's the prediction of the action given the state goal as the inputs and some problems that might arise with doing this in the curriculum learning setting or generally with doing this any kind of multi-task evolving curriculum style of training these agents so then we'll talk more about the minecraft environment and this simon says task for you know how you uh you know craft these items and learn to discover new items in the minecraft game and then again coming back to the quick takeaways we'll recap the curriculus that are tested how they do this you know intra episode and then cross training exploration bonus in addition to the simon says goal conditioning curriculum learning so it's a really interesting overall framework of combining exploration with this goal of curriculum learning is a very interesting framework and then they list some potential extensions for future work then we'll recap this you know scope of this extremely exciting research on poet the paired open-ended trailblazers algorithm how it's extended to be more general ai generating algorithms and then you know overall thinking about where this might lead to it's really exciting you know ideas of open-ended algorithms and then now with the resources of open ai and thinking about what could happen with this so this figure provides a great high level overview of the different curriculum learning slash exploration strategies that are explored in this paper so we start off with this red line where the agent only discovers 17 out of the 107 items in minecraft and the assignment says you know the set of items to discover by just doing uniform sampling and what this means with uniform sampling is that at the start of each uh episode the agent is given you know with equal probability one out of the 107 items as the goal where it will receive a reward only for reaching that item given as the goal so this is compared with this expiration bonus so in the expiration bonus even if the goal is say craft a diamond or build a sword i don't really know too much about minecraft but even if it's this more ambitious task the agent will still receive some reward initially from you know picking at the dirt and then collecting dirt or picking out a rock and you know doing these easier tasks it'll still receive some reward and have some learning signal to you know update its parameters and kind of learn what it's doing a little bit so then the uh you know the yellow dotted line is this dynamic exploration bonus so if the if the minecraft agent is just picking out the dirt and getting dirt it shouldn't just keep getting the expiration bonus for that so this dynamic expiration bonus is adjusting this based on kind of the difficulty of the overall items within this expiration bonus so even though throughout this it's always you know the main reward is from doing the simon says task which could be uh you know go get the diamond or whatever is on the cutting edge of that curriculum that's being designed and then the you know it has this expiration bonus as it discovers other kinds of items throughout it's like building up its inventory to craft these things and you know how the game of minecraft works so uh then this is showing the benefit of adding this curriculum so in the yellow line in the green line it's still a uniform sampling of the uh of the goal condition policy of the simon says task now we have an actual learning curriculum where the frontier of the items like the most ambitious items are being presented as the goal and then you see this catastrophic forgetting problem this is like this notorious problem in deep learning where if you train a model on task a then test b then test c by the time it's learned test c it'll it will no longer be able to achieve task a and it's one of those these interesting things about deep learning that doesn't make too much sense but this catastrophic forgetting problem is you know very evident you see it with these huge dips in the performance so then the dotted blue that performs the best is combining the bi-directional so when it drops you also sign that as the simon says task as well as expiration bonus and we'll get more into details of this you know in this video so here's another detail about the studies that is listed at the bottom of the caption this figure is that each run takes 21 days on 32 gpus so i think that's just interesting always to see the you know the computation time requirements 32 gpus 21 days pretty interesting seeing that that's what it takes to run out each of these you know it looks like 50 000 optimization steps for each of these episodes of minecraft so now let's break down the high level motivation of the study in this overall framework of curriculum learning curriculum learning is a really intuitive idea of thinking about learning where you obviously would need to learn to add or subtract and multiply before you do more complex things like calculus or you know learning about gradient descent these ideas of these natural things that you learn to walk before you run high school level before college level physics this idea of having these tasks that build upon each other but interestingly in deep learning say in supervised learning you just have these expert demonstrations of what the you know top level looks like and then you usually just sample a large batch so even if some examples aren't really learnable you have this large batch so at least there's some learning signal in the large batch and so on that's kind of how you overcome curriculum learning and then also there's this interesting problem again of course catastrophic forgetting where you might not even it's not even that they build on each other might have forgotten this old task so it's interesting thinking about these kind of like stepping stones of learning tasks and how in our deep neural network models they don't really you know adhere to this kind of learning schedule where you think any model that can say uh like summarize a scientific article with this abstractive sequence of sequence supervised learning framework should also have like a basic sense of natural language inference but really the models you know if you evaluated on that it would have no sense of natural language inference so tasks often vary in difficulty and depend on each other such as learning easier tasks first may help with learning more difficult tasks later and then curriculum learning can speed up learning by focusing on the next best task to learn narrowing the distribution of tasks being trained onto those that are currently learnable so instead of having this you know long training schedule maybe we can improve the efficiency and it wouldn't take 21 days to you know just run this experiment but generally running say a gbt3 model or you know training these vision transformers also takes a long time maybe curriculum learning is the answer for how we can avoid these long training schedules so again there's this idea relating the problem of exploration with these expert demonstrations and supervised learning and probably most interestingly at the end there's this paragraph about a sentence about if the model relies purely on imitating humans its maximum performance will be limited by the best demonstrations in our data set and even the combination of all the best demonstrations that humanity has to offer probably will not move the model far beyond human performance so this idea of having super intelligence or some kind of super performance on say tasks like you know in this case we're looking at minecraft or maybe even more open-ended tasks like as we'll look at next things like you know in the visual natural language domains that if it sidesteps the exploration problem all we'll be able to achieve is the ceiling of these supervised demonstrations and things like say you know alphago and how it plays against itself it leaves this idea of just supervised learning and imitation learning behavioral cloning these ideas and instead it has this open-ended idea of exploration and that's kind of the idea behind curriculum learning is trying to find these stepping stone objectives for exploring the environment and finding some kind of way to do something new like novelty search and intrinsic motivation so when we're thinking about tasks we can think about things like reward functions and environment distribution so say when we have the poet experiment with the bipedal walking agents we might describe the environment distributions as the meta parameters that render the environments that the agents then walk on so say how bumpy how many hills are on the you know terrain how many ditches there are how frequent they are how deep they are whether the agent has to jump over things and if there are hurdles like that how tall are they and so on this would be like the meta parameters of the environment distribution or say you have the paper from open ai solving a rubik's cube with a robot hand where the environment parameters are like the physics coefficients on the cube or the visual parameters like the lighting the color of the cube all these different ideas these are kind of like parameters that would make up an environment distribution so in this particular case of minecraft we're looking at tasks with identical environment distributions you're always in minecraft and you probably have the same you know distribution of whether you're in like the jungle or the woods or the beach or something like that but then there are different rewards or simon says it's giving you a different reward for which item you're trying to go achieve and they make the analogy with these other kind of tasks where say you're writing programs like the code the codex thing summarizing books like gt3 answering questions like gbg3 or generating images like maybe image gpt or style gain or one of these things they all have these universal environments of vision or language i think it's interesting i've never seen that kind of framing of thinking about vision or language as a universal environment where you're performing these different tasks in these different kinds of environments so it's an interesting kind of way of thinking about you know reward functions and environment distributions and thinking about what kind of these things really are that make up the framework of reinforced learning and environment state you know state actions reward signals and all these kind of ideas so to come back to the focus on curriculum learning the naive solution to trying to learn you know this minecraft world of crafting these items would be to try to learn all tasks simultaneously by uniformly sampling the task so with equal probability you assign any of the items as being what the agent will see receive a reward for achieving compared to curriculum learning we're going to try to organize the you know the order in which you receive a reward for crafting different items as well as having the explicit instruction with the goal conditioning to go and you know get this reward a key technical detail of these curriculum learning algorithms are going to be to define learnability what is the technical proxy that we use to say that one task is more learnable than another and then it should be at the frontier of the curriculum that we're presenting to the agent so the authors are going to use the change in success probability and that means you know the difference in success probability between these different time steps that we'll discuss more later but here are the problems with just using raw success probability compared to the change in success probability so the initial state of the environment could be randomized and this could highly bias the signal so say the task is i don't know to collect uh some particular kind of object but and then respond right next to the object to begin the episode compared to where you uh respawn or initialized far away from then you have to go and find the area as well as whatever else you have to do so then it you know generally depends on the task initial state and learning stage of the agent to just have success probability and then you could have a stochastic environment so some tasks may have an intermediate success probability like you have 50 success but it isn't learnable because it's completely random how it's determined you know whether you're successful or not so then with respect to measuring the change in the change in uh success probability you have these two kind of ideas the delta t which is if it's too small then you just have the slope of the noise if it's you know a small time step small number of episodes that have been attempted then you just have the slope of the noise and if it's too large then you're going to miss the recent changes in the success probability so then another idea is to have an exponential moving average of the trend rather than just individual snapshots so you could have a noisy estimate and having this exponential moving average would help have a better sense of the curve of the learning curve so these are two very important details of you know practically implementing these curriculum learning ideas so with this strategy of inferring learnability the authors are now going to have a sampling function where 90 of the weight are going to be on the 20 of the tasks with the largest re-weighted learning progress and i recommend checking out the papers for more details on the re-weighted learning progress these ideas of the delta t exponential moving average these are the little ideas of details of how you're inferring the learnability with this change in success probability as you're training the model so now that we have a sense of how the learning curriculum is defined with the change in success probability we'll break into some issues with goal conditioning and how we have to introduce this expiration bonus to further help with the learning of this curriculum of tasks so this idea of goal condition learning is where the reinforced learning agent is predicting its action given as input the state as well as the goal so the goal in this case is a one hot encoded vector that represents one of the 107 minecraft items so some vector like zero zero one zero zero zero zero is put as input as well as the state to help the you know model guide itself to this particular goal and so the problem with this is in the initialization it receives this new one hot encoded vector it has no embedding for it it has no idea how to use it it doesn't know how it relates to the previously learned behaviors right away and it doesn't know this relationship so it doesn't know how to interpret the new goal and therefore you need some other way of kind of smoothing it out and introducing this new objective to the curriculum and the you know learning of the agent so as previously mentioned the solution to this is to add an expiration bonus so if the agent is being tasked with making a diamond uh or making a diamond helmet if it achieves something like getting the dirt or a bowl or a ladder or something or logs or planks it'll still receive some reward for doing that but then the way that this reward is dynamic is that if you just keep getting dirt or you just keep crafting bowls the reward that you're receiving will decay exponentially or some you know curve like that so that's how they're structuring this expiration bonus that helps with transitioning to some new goals and here's a table that shows overall the 107 minecraft items that the agent is being asked to get throughout the training so before going back to the results the minecraft environment is a very interesting environment for reinforced and learning we've previously seen these open environments like open ai gym or mujoko or say the atari games but the minecraft environment does seem to be pretty different in the way that it has this it has this visual processing like other things like atari does have a visual pixel frame but this has a spatial awareness where you do have like the 360 degrees around the agent as you look around the minecraft environment and then there's the inferring causality and conducting experiments idea where you have to figure out how to craft the items and then which items would go together i think in the crafting but again i don't know too much about minecraft but it does seem to be a fundamentally different platform for you know doing this research with this these reinforcement learning agents and then there's this tech tree with many dependencies and this is an interesting idea that i think connects back to this code codex idea and really interesting ideas of building on this tech tree thing so then one of the interesting details that they have a maximum episode length of 30 minutes 9 000 time steps this is how long the agent can explore the environment to achieve the simon says reward if you're completely unfamiliar with the minecraft environment i highly recommend watching this video that the authors of the paper have published that shows what the minecraft task is how it collects these different things and crafts these items and exploring this simon says and then really looking into the episodes of the agent so really interesting visualization of these experiments so here's a quick reminder of the different learning strategies that are tested in this paper the red line is uniform sampling where you uniformly sample a simon says task and only achieving this item will result in any reward for the agent the green dotted line is where you still uniformly sample this the main goal and the goal conditioned policy but you also have an expiration bonus so if it finds dirt or crafts a bowl these kind of easier tasks i imagine it'll still receive some reward in the beginning and the difference between fixed and dynamic is in fixed it may get stuck in this local optimum where it just keeps crafting bowls because it keeps receiving the reward or maybe even though there's an exponential decay you can have different hybrid parameters that control that decay and so on the you know drilling into the details of this and then the dynamic exploration bonus is where it looks not only within the episode of having the decay within as it keeps crafting bowls but then across the run so as it's done 10 000 optimization steps you would just say the bias on the decay or the initial reward for a bowl dirt or like a gold sword or whatever else there is so then the learning curriculum this is structuring the you know the simon says task the goal condition policy so the light blue is the problem of catastrophic forgetting whereas it keeps learning these things it's forgotten how to do you know the earlier uh items and then this is the bi-directional progress where if it forgets substantially then you pivot back and have the forgotten task go back on the frontier having the magnitude of the change in success probability not only positive changes so some more details on the results these plots are coloring the success probability for each of these different tasks throughout training the uniform task never achieving any success on these ones with the bi-directional curve and then the unidirectional curve reaching deeper into the tech tree of the minecraft items and this is the plot showing the sampling so uniform sampling obviously it selects all of it uniformly compared to how the learning curriculum is sampling the task and then this plot is showing overall the state of how this is being evaluated so the agent doesn't need to achieve say this golden sword over and over again it just needs to do it with some probability of success so this is coloring the probability of success for this top agent that does achieve 82 of these items so the red ones it doesn't achieve the green ones that achieves uh greater than five percent success so pretty low bar but still interesting that achieves any of the items and then overall this minecraft environment does seem pretty incredible that it's able to navigate it and achieve these items so this is coloring and showing that low success probabilities are uh what's needed to qualify on adding to the items discovered in these experiments as some background to contextualize why these experiments are so exciting uh jeff clinton published this really interesting paper titled aigas ai generating algorithms so this is the idea of having an ai that can generate ai so an algorithm that produces intelligence sort of like the earth simulation kind of idea this idea and it's kind of also it's a little more detail-oriented than say the open-ended algorithms idea which kind of generally describes these ideas of novelty search intrinsic motivation and then say like the idea of what kind of algorithm would be interesting if you left it running for a million years so like not image net optimization it would be something like uh poet or these minecraft environments or the simulation of earth is the motivating example behind these open-ended algorithms so the ai generating algorithms are described as having these three pillars meta learning the architectures meta learning learning algorithms themselves and then generating the effective learning environment so i think of this minecraft thing as being the third pillar of generating the effect of learning environments other experiments like poet this is where you have this co-evolutionary framework between the agents that are learning to control a bipedal walking agent as well as the environments that they walk on so you're generating the learning environment for the agent and where it's doing its learning for controlling the bipedal uh walking robot and then you have these other things like generative teaching networks and also a synthetic petri dish where you generate the training data so you're generating the supervised learning data and it looks nothing like real endless digits but this is used to train a model that's evaluated on the mnist data set and then also kind of relate not necessarily in a learning environment this is more on the second pillar of the first pillar on middle learning architectures this is trying to find the solution to this catastrophic forgetting some kind of neural architecture search that could design an architecture that avoids this problem of catastrophic forgetting which is an interesting kind of component to these curriculums and these the sequential task learning and so on so overall these really interesting ideas of how ai an algorithm can create ai and what's necessary for building intelligence thank you so much for watching this explanation of multitask curriculum learning in a complex visual hard exploration domain minecraft this is a really exciting idea for combining this learning curriculum with the change in success probability with this intra episode and inter-episode intra episode expiration bonus as you have this additional reward signal for just discovering items that decays over time and also depends on the current learning stage of the agent as well as the simon says task and the main goal of doing something challenging like crafting a gold sword particularly the game of minecraft this is a really exciting environment for testing these reinforced and learning algorithms and overall i expect really exciting things to come out of this research thanks for watching and please subscribe to henry ai labs for more deep learning and ai videos [Music]

Original Description

Notion Link: https://ebony-scissor-725.notion.site/Henry-AI-Labs-Weekly-Update-July-15th-2021-a68f599395e3428c878dc74c5f0e1124 Thanks for watching! Please Subscribe! Chapters: 0:00 Introduction 0:06 Overview 4:42 Curriculum Learning 9:45 Defining Learnability 11:40 Goal-Conditioning and Exploration Bonus 13:14 Minecraft 14:34 Recap of Learning Strategies 17:10 AI-GAs

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 0 of 60

← Previous Next →

DeepWalk Explained

DeepWalk Explained

Inception Network Explained

Inception Network Explained

Progressive Growing of GANs Explained

Progressive Growing of GANs Explained

Improved Techniques for Training GANs

Improved Techniques for Training GANs

Word2Vec Explained

Word2Vec Explained

Must Read Papers on GANs

Must Read Papers on GANs

Unsupervised Feature Learning

Unsupervised Feature Learning

Self-Supervised GANs

Self-Supervised GANs

Embedding Graphs with Deep Learning

Embedding Graphs with Deep Learning

Transfer Learning in GANs

Transfer Learning in GANs

ReLU Activation Function

ReLU Activation Function

AC-GAN Explained

AC-GAN Explained

SimGAN Explained

SimGAN Explained

DC-GAN Explained!

DC-GAN Explained!

ResNet Explained!

ResNet Explained!

Graph Convolutional Networks

Graph Convolutional Networks

Neural Architecture Search

Neural Architecture Search

Video Classification with Deep Learning

Video Classification with Deep Learning

BigGANs in Data Augmentation

BigGANs in Data Augmentation

Introduction to Deep Learning

Introduction to Deep Learning

EfficientNet Explained!

EfficientNet Explained!

Self-Attention GAN

Self-Attention GAN

Curriculum Learning in Deep Neural Networks

Curriculum Learning in Deep Neural Networks

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Compression

Deep Compression

Skin Cancer Classification with Deep Learning

Skin Cancer Classification with Deep Learning

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

The Lottery Ticket Hypothesis Explained!

The Lottery Ticket Hypothesis Explained!

GauGAN Explained!

GauGAN Explained!

AutoML with Hyperband

AutoML with Hyperband

DL Podcast #3 | Yannic Kilcher | Population-Based Search

DL Podcast #3 | Yannic Kilcher | Population-Based Search

Weakly Supervised Pretraining

Weakly Supervised Pretraining

Image Data Augmentation for Deep Learning

Image Data Augmentation for Deep Learning

Unsupervised Data Augmentation

Unsupervised Data Augmentation

Wide ResNet Explained!

Wide ResNet Explained!

RevNet: Backpropagation without Storing Activations

RevNet: Backpropagation without Storing Activations

GANs with Fewer Labels

GANs with Fewer Labels

BigBiGAN Unsupervised Learning!

BigBiGAN Unsupervised Learning!

Self-Supervised Learning

Self-Supervised Learning

Multi-Task Self-Supervised Learning

Multi-Task Self-Supervised Learning

Self-Supervised GANs

Self-Supervised GANs

Population Based Training

Population Based Training

Show, Attend and Tell

Show, Attend and Tell

Siamese Neural Networks

Siamese Neural Networks

WaveGAN Explained!

WaveGAN Explained!

VAE-GAN Explained!

VAE-GAN Explained!

Evolution in Neural Architecture Search!

Evolution in Neural Architecture Search!

AI Research Weekly Update August 18th, 2019

AI Research Weekly Update August 18th, 2019

Weight Agnostic Neural Networks Explained!

Weight Agnostic Neural Networks Explained!

AI Research Weekly Update August 25th, 2019

AI Research Weekly Update August 25th, 2019

Neuroevolution of Augmenting Topologies (NEAT)

Neuroevolution of Augmenting Topologies (NEAT)

AI Research Weekly Update September 1st, 2019

AI Research Weekly Update September 1st, 2019

Randomly Wired Neural Networks

Randomly Wired Neural Networks

This video teaches how to implement multi-task curriculum learning in Minecraft, utilizing reinforced learning, goal conditioning, and exploration bonus. It highlights the benefits of adding a curriculum to the learning process and introduces a bi-directional approach to combining tasks. By watching this video, viewers can learn how to build effective multi-task learning models and utilize various tools and techniques to improve learning outcomes.

Key Takeaways

Build a reinforced learning agent
Implement goal conditioning
Utilize exploration bonus
Combine tasks using a bi-directional approach
Implement expiration bonus and dynamic exploration bonus
Use re-weighted learning progress to prioritize tasks
Design effective prompts for multi-task learning
Utilize Minecraft and Open AI for multi-task curriculum learning

💡 The bi-directional approach to combining tasks can help mitigate catastrophic forgetting and improve learning outcomes in multi-task curriculum learning.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related Reads

I Built an Image Steganography Tool — Hide Any File Inside a PNG with AES-256 Encryption

Learn to build an image steganography tool that hides files inside PNGs with AES-256 encryption, enhancing security and privacy

Dev.to · Rishu

FREE AI Sin City Photo Generator — Turn Any Photo Into High-Contrast Noir Art (2026)

Transform any photo into a Sin City-inspired high-contrast noir art using a free AI generator

Google makes Gemini’s personalized image generation free for all US users

Google's Gemini personalized image generation is now free for all US users, allowing them to generate images informed by their Google data

The Next Web AI

Gemini’s personalized AI image generation is now free for U.S. users

Gemini's AI image generation is now free for U.S. users, allowing for personalized images based on user interests and data

Chapters (8)

Introduction

0:06 Overview

4:42 Curriculum Learning

9:45 Defining Learnability

11:40 Goal-Conditioning and Exploration Bonus

13:14 Minecraft

14:34 Recap of Learning Strategies

17:10 AI-GAs

OpenAI Kills Sora then Descends into Chaos