Multi-Task Self-Supervised Learning
Key Takeaways
The video discusses a research paper on multi-task self-supervised learning, which combines self-supervised tasks to learn representations through learning multiple tasks at once, achieving improved performance on downstream tasks like ImageNet classification and object detection. The paper proposes input harmonization, feature masking, and Lasso penalty to encourage sparse combination and uses a distributed training scheme with a hybrid approach.
Full Transcript
[Music] this video will explain a really interesting study from google's deepmind lab on multi task self supervised visual learning so the headline idea is that we have the self supervised tasks like predicting the rotation of an image like 90 degree rotation 180 degrees and then we have the self supervised tasks like predicting the relative location of patches extracted from the same image so we don't really quite understand exactly how these representations work but what if we use this same convolutional neural network feature extractor to perform multiple assault supervised tasks and in this paper the headline idea is that they find that combining these self supervised tasks always improves performance in every combination of the tasks studied so the best joint a networked trainer multiple social supervisors ization test learns a representation that does just as well as pre training on the labeled image net data set on the pascal object detection and this is much more interesting because these self supervised tasks can scale up to billions and trillions of images like what we have on YouTube and Google Images and Instagram and maybe Pinterest - without anyone needing to label any of these images for tasks like object detection which is putting a bounding box around certain images like you take these representations and then you fine-tune it on the computer vision task that you are trying to solve so social supervised learning modern deep neural networks or data star they can fit random labels for large image collections and there's limitlessly endless supply of unlabeled images available so these are the tasks they study in this paper relative position this is where you take two crops from the same image and then you predict it's like an eight-way classification problem where you predict like top left left bottom left you know bottom bottom right predict the relative position of the patches with reference to each other colorization is where you have a grayscale image and then you predict the RTB corresponding pair the exemplars ask is where you take an image and then you perform a ton of data augmentations on it and then you have a Siamese Network which assign these network basically says you pass two images through the same feature extractor and then you have another like kind of multi-layer perceptron or something like that at the end that connects these two features and perform some tasks like in the example our class it's like is this still the same image even though it's been arguments like crazy and then motion segmentation is this idea of predicting using video frames like the next via frame and this is one of my favorite tasks because it seems intuitive that the way animals and humans might learn visual representations is by predicting the next frame and video because we're always processing these video frames so multitask learning the idea here is that you have the image input and then you have these feature extractors and then you pass the same features to these different task heads so these this could be the rotation these are like specific parameters to the self supervised tasks so these parameters would be only for rotation and these would be only for colorization oh I'm sorry and they also do test rotation as itself supervised tasks and forgot to put that on the slide so a previous study on multitask learning combined seven supervised tasks where there are labeled data sets but still there's no free lunch in this and they found the best rules results using just two out of the seven tasks rather than using all of the tasks available so there are some problems with combining tasks that the author's mentioned and that's the input channel can conflict for example when you're colorization is the task you necessarily have to pass these like 224 by 224 by one input images you can't have the RGB input so and then another one is learning tasks might conflict like semantic categorization would be different from instance matching like for example you might need fine-grained details to tell the difference between a golden retriever and some other specific dog breed whereas you wouldn't need that kind of detail for a semantic segmentation task where you're trying to label the pixels as dogs with labels from other pixels as grass or sky or ocean or something so the first solution they proposed input harmonization and this is how they sync up the inputs to their network and then this is a really interesting idea that the results from it are a little disappointing but this idea is probably going to be around for a while because it's a very powerful idea and what it is is basically as you have the shared feature representation each task they're not going to share the same features rather they're going to go and apply these masks to the intermediate features so this rotation task isn't just going to take the same features as the colorization task rather it's going to go and learn a set of weights such as two like which features it's going to use from the shared representation so here's a much simpler idea this it says imagine this matrix is our features like five four six five something like this task one might have this kind of mask so you wouldn't look at these features at all and then task two might have this kind of mask putting an extra emphasis on this is something so this is kind of the idea of each task would have its own feature mask and it would learn its feature mastering training so this doesn't work too well in this study but it does show an almost 1% improvement when they do this for evaluation so for the imagenet classification they would take a separate mask for classification features as detection or segmentation features and then they also impose this lasso penalty which is they're trying to encourage the combination to be sparse and I'm not quite sure yet on my personal understanding of this sparsity constraints and neural networks so again they're also going to use this with the evaluation test each evaluation test is going to have a feature mask rather than just all using the same features for detection classification and then depth prediction is what they'd use in this paper so they're the multi task architectures is there's the common trunk that's where they all share the same features and then there's the lasso where they have different features and some intuition to this it might be like if we're sharing a feature representation between player a and player B player a is gradient might be like add five to this value and player B might say Oh minus five to this and then they'll just do this on and on and on and on and it won't really result in any kind of interesting learning and so this is kind of like the mixture of experts idea how you can have a represented like a very big representation and then different tasks that have different ways of accessing that intermediate representation so also they implement this with the distributed training scheme and it's pretty interesting that they use they they find that just asynchronous training unstable but with a hybrid approach where they basically are synchronous when the workers are doing the same task so I give two tasks one through let's say ten is doing a rotation prediction they'll wait until all ten of them are finished with their gradients and then they'll update the network that they wouldn't wait for like a colorization worker to be finished so they're going to test this on they're gonna evaluate the self supervised representations on imagenet classification Pascal V OC detection and an NYU depth prediction these are the data sets and tasks and so the evaluation procedure is to take the last block and they're going to test the just using the same features for all tests and then the you know applying them test was difficult it so these are the results from the individual tasks no multi task learning so this is the imagenet the fully supervised benchmark and this is on top five accuracy not top one which is also kind of if you're surprised at the high accuracy of all this but so here you see Chloe zation performs the best but all you know relatively similar and then it all tests except for in the depth prediction the imagenet features aren't really too useful so this is just the table just to reference for how they all perform relative to each other on these different tasks imagenet top one image no top five object detection and then weiu depth prediction so then this is the most interesting results of the paper this is the results from combining the tasks so most interestingly they give the best result from just combining all of a task rotation prediction colorization exemplar and motion segmentation and adding the motion segmentation does perform does provide a slight boost over not using it and then most interestingly you see ten percent improvement from just do rotation prediction and then really interestingly you see that they've basically closed the gap between object detection and using image net pre training and this is especially interesting because you could do this with like billions of images you don't need any labels for this and but they're still pretty far on image net classification but it's sort of unfair because I mean this is the exact task that they're testing it on so this is the harmonization and lasso results the two novel algorithms they kind of the modifications they propose and they don't really have any the result isn't that interesting but it's so a really cool idea and they they do get a little bit of a good result on the evaluation only vaso so the concluding Dodson's that is definitely interesting to see how combining tasks outperform single tats it's almost like you can think of adding maybe a generative adversarial Network to this and you can just think of being creative and figuring out what other self supervised learning tasks can be derived maybe like visual question answering could be integrated with this there's definitely going to be more self supervised tasks that come out soon I would predict so then the improvement from the single task 59% to 69% is really interesting with the multi task addition it's still far from using the labels but you can imagine using a larger data set a larger model or maybe even using neural architecture search and then optimizing the model for self supervised learning so supervised multitask learning and then it's this lasso feature masking idea this definitely seems like an idea the future it's a really interesting idea to wrap your head around so thanks for watching this video on multi task self supervised learning from deep mind please subscribe to Henry AI labs for more deep learning videos thanks for watching
Original Description
Self-Supervised Learning tasks have been able to produce very useful visual representations for downstream tasks like ImageNet classification. This video explains a study that attempts to combine self-supervised tasks to learn representations through learning multiple tasks at once! This result is very interesting!
Thanks for watching! Please Subscribe!
Paper Link: https://arxiv.org/pdf/1708.07860.pdf
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 45 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
▶
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: Research Methods
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI