Multi-Task Self-Supervised Learning

Connor Shorten · Beginner ·📄 Research Papers Explained ·6y ago

Key Takeaways

The video discusses a research paper on multi-task self-supervised learning, which combines self-supervised tasks to learn representations through learning multiple tasks at once, achieving improved performance on downstream tasks like ImageNet classification and object detection. The paper proposes input harmonization, feature masking, and Lasso penalty to encourage sparse combination and uses a distributed training scheme with a hybrid approach.

Full Transcript

[Music] this video will explain a really interesting study from google's deepmind lab on multi task self supervised visual learning so the headline idea is that we have the self supervised tasks like predicting the rotation of an image like 90 degree rotation 180 degrees and then we have the self supervised tasks like predicting the relative location of patches extracted from the same image so we don't really quite understand exactly how these representations work but what if we use this same convolutional neural network feature extractor to perform multiple assault supervised tasks and in this paper the headline idea is that they find that combining these self supervised tasks always improves performance in every combination of the tasks studied so the best joint a networked trainer multiple social supervisors ization test learns a representation that does just as well as pre training on the labeled image net data set on the pascal object detection and this is much more interesting because these self supervised tasks can scale up to billions and trillions of images like what we have on YouTube and Google Images and Instagram and maybe Pinterest - without anyone needing to label any of these images for tasks like object detection which is putting a bounding box around certain images like you take these representations and then you fine-tune it on the computer vision task that you are trying to solve so social supervised learning modern deep neural networks or data star they can fit random labels for large image collections and there's limitlessly endless supply of unlabeled images available so these are the tasks they study in this paper relative position this is where you take two crops from the same image and then you predict it's like an eight-way classification problem where you predict like top left left bottom left you know bottom bottom right predict the relative position of the patches with reference to each other colorization is where you have a grayscale image and then you predict the RTB corresponding pair the exemplars ask is where you take an image and then you perform a ton of data augmentations on it and then you have a Siamese Network which assign these network basically says you pass two images through the same feature extractor and then you have another like kind of multi-layer perceptron or something like that at the end that connects these two features and perform some tasks like in the example our class it's like is this still the same image even though it's been arguments like crazy and then motion segmentation is this idea of predicting using video frames like the next via frame and this is one of my favorite tasks because it seems intuitive that the way animals and humans might learn visual representations is by predicting the next frame and video because we're always processing these video frames so multitask learning the idea here is that you have the image input and then you have these feature extractors and then you pass the same features to these different task heads so these this could be the rotation these are like specific parameters to the self supervised tasks so these parameters would be only for rotation and these would be only for colorization oh I'm sorry and they also do test rotation as itself supervised tasks and forgot to put that on the slide so a previous study on multitask learning combined seven supervised tasks where there are labeled data sets but still there's no free lunch in this and they found the best rules results using just two out of the seven tasks rather than using all of the tasks available so there are some problems with combining tasks that the author's mentioned and that's the input channel can conflict for example when you're colorization is the task you necessarily have to pass these like 224 by 224 by one input images you can't have the RGB input so and then another one is learning tasks might conflict like semantic categorization would be different from instance matching like for example you might need fine-grained details to tell the difference between a golden retriever and some other specific dog breed whereas you wouldn't need that kind of detail for a semantic segmentation task where you're trying to label the pixels as dogs with labels from other pixels as grass or sky or ocean or something so the first solution they proposed input harmonization and this is how they sync up the inputs to their network and then this is a really interesting idea that the results from it are a little disappointing but this idea is probably going to be around for a while because it's a very powerful idea and what it is is basically as you have the shared feature representation each task they're not going to share the same features rather they're going to go and apply these masks to the intermediate features so this rotation task isn't just going to take the same features as the colorization task rather it's going to go and learn a set of weights such as two like which features it's going to use from the shared representation so here's a much simpler idea this it says imagine this matrix is our features like five four six five something like this task one might have this kind of mask so you wouldn't look at these features at all and then task two might have this kind of mask putting an extra emphasis on this is something so this is kind of the idea of each task would have its own feature mask and it would learn its feature mastering training so this doesn't work too well in this study but it does show an almost 1% improvement when they do this for evaluation so for the imagenet classification they would take a separate mask for classification features as detection or segmentation features and then they also impose this lasso penalty which is they're trying to encourage the combination to be sparse and I'm not quite sure yet on my personal understanding of this sparsity constraints and neural networks so again they're also going to use this with the evaluation test each evaluation test is going to have a feature mask rather than just all using the same features for detection classification and then depth prediction is what they'd use in this paper so they're the multi task architectures is there's the common trunk that's where they all share the same features and then there's the lasso where they have different features and some intuition to this it might be like if we're sharing a feature representation between player a and player B player a is gradient might be like add five to this value and player B might say Oh minus five to this and then they'll just do this on and on and on and on and it won't really result in any kind of interesting learning and so this is kind of like the mixture of experts idea how you can have a represented like a very big representation and then different tasks that have different ways of accessing that intermediate representation so also they implement this with the distributed training scheme and it's pretty interesting that they use they they find that just asynchronous training unstable but with a hybrid approach where they basically are synchronous when the workers are doing the same task so I give two tasks one through let's say ten is doing a rotation prediction they'll wait until all ten of them are finished with their gradients and then they'll update the network that they wouldn't wait for like a colorization worker to be finished so they're going to test this on they're gonna evaluate the self supervised representations on imagenet classification Pascal V OC detection and an NYU depth prediction these are the data sets and tasks and so the evaluation procedure is to take the last block and they're going to test the just using the same features for all tests and then the you know applying them test was difficult it so these are the results from the individual tasks no multi task learning so this is the imagenet the fully supervised benchmark and this is on top five accuracy not top one which is also kind of if you're surprised at the high accuracy of all this but so here you see Chloe zation performs the best but all you know relatively similar and then it all tests except for in the depth prediction the imagenet features aren't really too useful so this is just the table just to reference for how they all perform relative to each other on these different tasks imagenet top one image no top five object detection and then weiu depth prediction so then this is the most interesting results of the paper this is the results from combining the tasks so most interestingly they give the best result from just combining all of a task rotation prediction colorization exemplar and motion segmentation and adding the motion segmentation does perform does provide a slight boost over not using it and then most interestingly you see ten percent improvement from just do rotation prediction and then really interestingly you see that they've basically closed the gap between object detection and using image net pre training and this is especially interesting because you could do this with like billions of images you don't need any labels for this and but they're still pretty far on image net classification but it's sort of unfair because I mean this is the exact task that they're testing it on so this is the harmonization and lasso results the two novel algorithms they kind of the modifications they propose and they don't really have any the result isn't that interesting but it's so a really cool idea and they they do get a little bit of a good result on the evaluation only vaso so the concluding Dodson's that is definitely interesting to see how combining tasks outperform single tats it's almost like you can think of adding maybe a generative adversarial Network to this and you can just think of being creative and figuring out what other self supervised learning tasks can be derived maybe like visual question answering could be integrated with this there's definitely going to be more self supervised tasks that come out soon I would predict so then the improvement from the single task 59% to 69% is really interesting with the multi task addition it's still far from using the labels but you can imagine using a larger data set a larger model or maybe even using neural architecture search and then optimizing the model for self supervised learning so supervised multitask learning and then it's this lasso feature masking idea this definitely seems like an idea the future it's a really interesting idea to wrap your head around so thanks for watching this video on multi task self supervised learning from deep mind please subscribe to Henry AI labs for more deep learning videos thanks for watching

Original Description

Self-Supervised Learning tasks have been able to produce very useful visual representations for downstream tasks like ImageNet classification. This video explains a study that attempts to combine self-supervised tasks to learn representations through learning multiple tasks at once! This result is very interesting! Thanks for watching! Please Subscribe! Paper Link: https://arxiv.org/pdf/1708.07860.pdf
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 45 of 60

1 DenseNets
DenseNets
Connor Shorten
2 DeepWalk Explained
DeepWalk Explained
Connor Shorten
3 Inception Network Explained
Inception Network Explained
Connor Shorten
4 StackGAN
StackGAN
Connor Shorten
5 StyleGAN
StyleGAN
Connor Shorten
6 Progressive Growing of GANs Explained
Progressive Growing of GANs Explained
Connor Shorten
7 Improved Techniques for Training GANs
Improved Techniques for Training GANs
Connor Shorten
8 Word2Vec Explained
Word2Vec Explained
Connor Shorten
9 Must Read Papers on GANs
Must Read Papers on GANs
Connor Shorten
10 Unsupervised Feature Learning
Unsupervised Feature Learning
Connor Shorten
11 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
12 Embedding Graphs with Deep Learning
Embedding Graphs with Deep Learning
Connor Shorten
13 Transfer Learning in GANs
Transfer Learning in GANs
Connor Shorten
14 ReLU Activation Function
ReLU Activation Function
Connor Shorten
15 AC-GAN Explained
AC-GAN Explained
Connor Shorten
16 SimGAN Explained
SimGAN Explained
Connor Shorten
17 DC-GAN Explained!
DC-GAN Explained!
Connor Shorten
18 ResNet Explained!
ResNet Explained!
Connor Shorten
19 Graph Convolutional Networks
Graph Convolutional Networks
Connor Shorten
20 Neural Architecture Search
Neural Architecture Search
Connor Shorten
21 Henry AI Labs
Henry AI Labs
Connor Shorten
22 Video Classification with Deep Learning
Video Classification with Deep Learning
Connor Shorten
23 BigGANs in Data Augmentation
BigGANs in Data Augmentation
Connor Shorten
24 Introduction to Deep Learning
Introduction to Deep Learning
Connor Shorten
25 EfficientNet Explained!
EfficientNet Explained!
Connor Shorten
26 Self-Attention GAN
Self-Attention GAN
Connor Shorten
27 Curriculum Learning in Deep Neural Networks
Curriculum Learning in Deep Neural Networks
Connor Shorten
28 Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
29 Deep Compression
Deep Compression
Connor Shorten
30 Skin Cancer Classification with Deep Learning
Skin Cancer Classification with Deep Learning
Connor Shorten
31 Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
32 The Lottery Ticket Hypothesis Explained!
The Lottery Ticket Hypothesis Explained!
Connor Shorten
33 SqueezeNet
SqueezeNet
Connor Shorten
34 GauGAN Explained!
GauGAN Explained!
Connor Shorten
35 AutoML with Hyperband
AutoML with Hyperband
Connor Shorten
36 DL Podcast #3 | Yannic Kilcher | Population-Based Search
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
37 Weakly Supervised Pretraining
Weakly Supervised Pretraining
Connor Shorten
38 Image Data Augmentation for Deep Learning
Image Data Augmentation for Deep Learning
Connor Shorten
39 Unsupervised Data Augmentation
Unsupervised Data Augmentation
Connor Shorten
40 Wide ResNet Explained!
Wide ResNet Explained!
Connor Shorten
41 RevNet: Backpropagation without Storing Activations
RevNet: Backpropagation without Storing Activations
Connor Shorten
42 GANs with Fewer Labels
GANs with Fewer Labels
Connor Shorten
43 BigBiGAN Unsupervised Learning!
BigBiGAN Unsupervised Learning!
Connor Shorten
44 Self-Supervised Learning
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Multi-Task Self-Supervised Learning
Connor Shorten
46 Self-Supervised GANs
Self-Supervised GANs
Connor Shorten
47 Population Based Training
Population Based Training
Connor Shorten
48 Show, Attend and Tell
Show, Attend and Tell
Connor Shorten
49 Siamese Neural Networks
Siamese Neural Networks
Connor Shorten
50 WaveGAN Explained!
WaveGAN Explained!
Connor Shorten
51 VAE-GAN Explained!
VAE-GAN Explained!
Connor Shorten
52 Evolution in Neural Architecture Search!
Evolution in Neural Architecture Search!
Connor Shorten
53 AI Research Weekly Update August 18th, 2019
AI Research Weekly Update August 18th, 2019
Connor Shorten
54 Weight Agnostic Neural Networks Explained!
Weight Agnostic Neural Networks Explained!
Connor Shorten
55 AI Research Weekly Update August 25th, 2019
AI Research Weekly Update August 25th, 2019
Connor Shorten
56 Neuroevolution of Augmenting Topologies (NEAT)
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
57 CoDeepNEAT
CoDeepNEAT
Connor Shorten
58 AI Research Weekly Update September 1st, 2019
AI Research Weekly Update September 1st, 2019
Connor Shorten
59 Randomly Wired Neural Networks
Randomly Wired Neural Networks
Connor Shorten
60 Genetic CNN
Genetic CNN
Connor Shorten

This video teaches the concept of multi-task self-supervised learning and its application to downstream tasks like object detection and depth prediction. The paper proposes several techniques to improve the performance of multi-task learning, including input harmonization and Lasso penalty. By watching this video, viewers can learn how to design and implement multi-task self-supervised learning experiments and analyze the results.

Key Takeaways
  1. Choose a set of self-supervised tasks to combine
  2. Implement input harmonization to sync up inputs to the network
  3. Apply feature masking to learn a set of weights
  4. Impose Lasso penalty to encourage sparse combination
  5. Use a distributed training scheme with a hybrid approach
  6. Evaluate the performance of multi-task learning on downstream tasks
💡 Combining self-supervised tasks can learn a representation that does just as well as pre-training on labeled ImageNet data set, without needing to label any of the images.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →