Multi-Task Self-Supervised Learning

Connor Shorten · Beginner ·📄 Research Papers Explained ·6y ago

Skills: Research Methods90%Reading ML Papers80%Unsupervised Learning70%

Key Takeaways

The video discusses a research paper on multi-task self-supervised learning, which combines self-supervised tasks to learn representations through learning multiple tasks at once, achieving improved performance on downstream tasks like ImageNet classification and object detection. The paper proposes input harmonization, feature masking, and Lasso penalty to encourage sparse combination and uses a distributed training scheme with a hybrid approach.

Full Transcript

[Music] this video will explain a really interesting study from google's deepmind lab on multi task self supervised visual learning so the headline idea is that we have the self supervised tasks like predicting the rotation of an image like 90 degree rotation 180 degrees and then we have the self supervised tasks like predicting the relative location of patches extracted from the same image so we don't really quite understand exactly how these representations work but what if we use this same convolutional neural network feature extractor to perform multiple assault supervised tasks and in this paper the headline idea is that they find that combining these self supervised tasks always improves performance in every combination of the tasks studied so the best joint a networked trainer multiple social supervisors ization test learns a representation that does just as well as pre training on the labeled image net data set on the pascal object detection and this is much more interesting because these self supervised tasks can scale up to billions and trillions of images like what we have on YouTube and Google Images and Instagram and maybe Pinterest - without anyone needing to label any of these images for tasks like object detection which is putting a bounding box around certain images like you take these representations and then you fine-tune it on the computer vision task that you are trying to solve so social supervised learning modern deep neural networks or data star they can fit random labels for large image collections and there's limitlessly endless supply of unlabeled images available so these are the tasks they study in this paper relative position this is where you take two crops from the same image and then you predict it's like an eight-way classification problem where you predict like top left left bottom left you know bottom bottom right predict the relative position of the patches with reference to each other colorization is where you have a grayscale image and then you predict the RTB corresponding pair the exemplars ask is where you take an image and then you perform a ton of data augmentations on it and then you have a Siamese Network which assign these network basically says you pass two images through the same feature extractor and then you have another like kind of multi-layer perceptron or something like that at the end that connects these two features and perform some tasks like in the example our class it's like is this still the same image even though it's been arguments like crazy and then motion segmentation is this idea of predicting using video frames like the next via frame and this is one of my favorite tasks because it seems intuitive that the way animals and humans might learn visual representations is by predicting the next frame and video because we're always processing these video frames so multitask learning the idea here is that you have the image input and then you have these feature extractors and then you pass the same features to these different task heads so these this could be the rotation these are like specific parameters to the self supervised tasks so these parameters would be only for rotation and these would be only for colorization oh I'm sorry and they also do test rotation as itself supervised tasks and forgot to put that on the slide so a previous study on multitask learning combined seven supervised tasks where there are labeled data sets but still there's no free lunch in this and they found the best rules results using just two out of the seven tasks rather than using all of the tasks available so there are some problems with combining tasks that the author's mentioned and that's the input channel can conflict for example when you're colorization is the task you necessarily have to pass these like 224 by 224 by one input images you can't have the RGB input so and then another one is learning tasks might conflict like semantic categorization would be different from instance matching like for example you might need fine-grained details to tell the difference between a golden retriever and some other specific dog breed whereas you wouldn't need that kind of detail for a semantic segmentation task where you're trying to label the pixels as dogs with labels from other pixels as grass or sky or ocean or something so the first solution they proposed input harmonization and this is how they sync up the inputs to their network and then this is a really interesting idea that the results from it are a little disappointing but this idea is probably going to be around for a while because it's a very powerful idea and what it is is basically as you have the shared feature representation each task they're not going to share the same features rather they're going to go and apply these masks to the intermediate features so this rotation task isn't just going to take the same features as the colorization task rather it's going to go and learn a set of weights such as two like which features it's going to use from the shared representation so here's a much simpler idea this it says imagine this matrix is our features like five four six five something like this task one might have this kind of mask so you wouldn't look at these features at all and then task two might have this kind of mask putting an extra emphasis on this is something so this is kind of the idea of each task would have its own feature mask and it would learn its feature mastering training so this doesn't work too well in this study but it does show an almost 1% improvement when they do this for evaluation so for the imagenet classification they would take a separate mask for classification features as detection or segmentation features and then they also impose this lasso penalty which is they're trying to encourage the combination to be sparse and I'm not quite sure yet on my personal understanding of this sparsity constraints and neural networks so again they're also going to use this with the evaluation test each evaluation test is going to have a feature mask rather than just all using the same features for detection classification and then depth prediction is what they'd use in this paper so they're the multi task architectures is there's the common trunk that's where they all share the same features and then there's the lasso where they have different features and some intuition to this it might be like if we're sharing a feature representation between player a and player B player a is gradient might be like add five to this value and player B might say Oh minus five to this and then they'll just do this on and on and on and on and it won't really result in any kind of interesting learning and so this is kind of like the mixture of experts idea how you can have a represented like a very big representation and then different tasks that have different ways of accessing that intermediate representation so also they implement this with the distributed training scheme and it's pretty interesting that they use they they find that just asynchronous training unstable but with a hybrid approach where they basically are synchronous when the workers are doing the same task so I give two tasks one through let's say ten is doing a rotation prediction they'll wait until all ten of them are finished with their gradients and then they'll update the network that they wouldn't wait for like a colorization worker to be finished so they're going to test this on they're gonna evaluate the self supervised representations on imagenet classification Pascal V OC detection and an NYU depth prediction these are the data sets and tasks and so the evaluation procedure is to take the last block and they're going to test the just using the same features for all tests and then the you know applying them test was difficult it so these are the results from the individual tasks no multi task learning so this is the imagenet the fully supervised benchmark and this is on top five accuracy not top one which is also kind of if you're surprised at the high accuracy of all this but so here you see Chloe zation performs the best but all you know relatively similar and then it all tests except for in the depth prediction the imagenet features aren't really too useful so this is just the table just to reference for how they all perform relative to each other on these different tasks imagenet top one image no top five object detection and then weiu depth prediction so then this is the most interesting results of the paper this is the results from combining the tasks so most interestingly they give the best result from just combining all of a task rotation prediction colorization exemplar and motion segmentation and adding the motion segmentation does perform does provide a slight boost over not using it and then most interestingly you see ten percent improvement from just do rotation prediction and then really interestingly you see that they've basically closed the gap between object detection and using image net pre training and this is especially interesting because you could do this with like billions of images you don't need any labels for this and but they're still pretty far on image net classification but it's sort of unfair because I mean this is the exact task that they're testing it on so this is the harmonization and lasso results the two novel algorithms they kind of the modifications they propose and they don't really have any the result isn't that interesting but it's so a really cool idea and they they do get a little bit of a good result on the evaluation only vaso so the concluding Dodson's that is definitely interesting to see how combining tasks outperform single tats it's almost like you can think of adding maybe a generative adversarial Network to this and you can just think of being creative and figuring out what other self supervised learning tasks can be derived maybe like visual question answering could be integrated with this there's definitely going to be more self supervised tasks that come out soon I would predict so then the improvement from the single task 59% to 69% is really interesting with the multi task addition it's still far from using the labels but you can imagine using a larger data set a larger model or maybe even using neural architecture search and then optimizing the model for self supervised learning so supervised multitask learning and then it's this lasso feature masking idea this definitely seems like an idea the future it's a really interesting idea to wrap your head around so thanks for watching this video on multi task self supervised learning from deep mind please subscribe to Henry AI labs for more deep learning videos thanks for watching

Original Description

Self-Supervised Learning tasks have been able to produce very useful visual representations for downstream tasks like ImageNet classification. This video explains a study that attempts to combine self-supervised tasks to learn representations through learning multiple tasks at once! This result is very interesting! Thanks for watching! Please Subscribe! Paper Link: https://arxiv.org/pdf/1708.07860.pdf

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 45 of 60

← Previous Next →

DeepWalk Explained

DeepWalk Explained

Inception Network Explained

Inception Network Explained

Progressive Growing of GANs Explained

Progressive Growing of GANs Explained

Improved Techniques for Training GANs

Improved Techniques for Training GANs

Word2Vec Explained

Word2Vec Explained

Must Read Papers on GANs

Must Read Papers on GANs

Unsupervised Feature Learning

Unsupervised Feature Learning

Self-Supervised GANs

Self-Supervised GANs

Embedding Graphs with Deep Learning

Embedding Graphs with Deep Learning

Transfer Learning in GANs

Transfer Learning in GANs

ReLU Activation Function

ReLU Activation Function

AC-GAN Explained

AC-GAN Explained

SimGAN Explained

SimGAN Explained

DC-GAN Explained!

DC-GAN Explained!

ResNet Explained!

ResNet Explained!

Graph Convolutional Networks

Graph Convolutional Networks

Neural Architecture Search

Neural Architecture Search

Video Classification with Deep Learning

Video Classification with Deep Learning

BigGANs in Data Augmentation

BigGANs in Data Augmentation

Introduction to Deep Learning

Introduction to Deep Learning

EfficientNet Explained!

EfficientNet Explained!

Self-Attention GAN

Self-Attention GAN

Curriculum Learning in Deep Neural Networks

Curriculum Learning in Deep Neural Networks

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Compression

Deep Compression

Skin Cancer Classification with Deep Learning

Skin Cancer Classification with Deep Learning

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

The Lottery Ticket Hypothesis Explained!

The Lottery Ticket Hypothesis Explained!

GauGAN Explained!

GauGAN Explained!

AutoML with Hyperband

AutoML with Hyperband

DL Podcast #3 | Yannic Kilcher | Population-Based Search

DL Podcast #3 | Yannic Kilcher | Population-Based Search

Weakly Supervised Pretraining

Weakly Supervised Pretraining

Image Data Augmentation for Deep Learning

Image Data Augmentation for Deep Learning

Unsupervised Data Augmentation

Unsupervised Data Augmentation

Wide ResNet Explained!

Wide ResNet Explained!

RevNet: Backpropagation without Storing Activations

RevNet: Backpropagation without Storing Activations

GANs with Fewer Labels

GANs with Fewer Labels

BigBiGAN Unsupervised Learning!

BigBiGAN Unsupervised Learning!

Self-Supervised Learning

Self-Supervised Learning

Multi-Task Self-Supervised Learning

Multi-Task Self-Supervised Learning

Self-Supervised GANs

Self-Supervised GANs

Population Based Training

Population Based Training

Show, Attend and Tell

Show, Attend and Tell

Siamese Neural Networks

Siamese Neural Networks

WaveGAN Explained!

WaveGAN Explained!

VAE-GAN Explained!

VAE-GAN Explained!

Evolution in Neural Architecture Search!

Evolution in Neural Architecture Search!

AI Research Weekly Update August 18th, 2019

AI Research Weekly Update August 18th, 2019

Weight Agnostic Neural Networks Explained!

Weight Agnostic Neural Networks Explained!

AI Research Weekly Update August 25th, 2019

AI Research Weekly Update August 25th, 2019

Neuroevolution of Augmenting Topologies (NEAT)

Neuroevolution of Augmenting Topologies (NEAT)

AI Research Weekly Update September 1st, 2019

AI Research Weekly Update September 1st, 2019

Randomly Wired Neural Networks

Randomly Wired Neural Networks

This video teaches the concept of multi-task self-supervised learning and its application to downstream tasks like object detection and depth prediction. The paper proposes several techniques to improve the performance of multi-task learning, including input harmonization and Lasso penalty. By watching this video, viewers can learn how to design and implement multi-task self-supervised learning experiments and analyze the results.

Key Takeaways

Choose a set of self-supervised tasks to combine
Implement input harmonization to sync up inputs to the network
Apply feature masking to learn a set of weights
Impose Lasso penalty to encourage sparse combination
Use a distributed training scheme with a hybrid approach
Evaluate the performance of multi-task learning on downstream tasks

💡 Combining self-supervised tasks can learn a representation that does just as well as pre-training on labeled ImageNet data set, without needing to label any of the images.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling