Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Skills:
Research Methods90%Unsupervised Learning90%Reading ML Papers80%Paper Reproduction70%ML Maths Basics60%
Key Takeaways
The video discusses the challenges of unsupervised learning of disentangled representations, specifically in variational autoencoders, and proposes a new method for evaluating and learning disentangled representations. The talk covers the key concepts of disentanglement, unsupervised learning, and generative models, and highlights the importance of understanding the assumptions and limitations of current methods.
Full Transcript
all right everyone today we're gonna look at this paper challenging common assumptions in the unsupervised learning of decent language representations by Francesca l'hotel oh and a bunch of other people at Google IO ETH Zurich and MPI folks disclaimer I know these people and I've talked to them about this work so just say no where I'm coming from it's a skip paper and it's fairly sure to explain so let's go over it the main thing here is what's called disentanglement so disentanglement is kind of a property of data in unsupervised learning or not data of your model that you would like to have in unsupervised learning in here especially in generative models so what they focus on is like Auto encoding here and what that means is I have some data point which could be an image let's draw an image here and I compress this usually into a vector and the vector has a couple of dimensions this is a representation of the data and from this representation what I can do is I can produce an image again and if I train an auto encoder I will enforce that my model so both of these are my model this is called an encoder and this is called a decoder that what they do is that the final image then looks like the original image this is an auto encoder basically a compression algorithm that tries to find representations to see that we can reconstruct the original image again here we go a little further in that we use what's called a variational colors so all of these all of these experiments here use variants of the variational auto encoder and what the variational auto encoder let's skip some here a variational auto encoder is the same thing as an auto encoder except it's a probabilistic framework so what you do is here on the bottom you can see an equation that basically is the objective for the AE and what it does is it says okay I have an image let's say this my image and I use an encoder like in an auto encoder and that gives me a representation okay but now I don't use this representation directly to decode but this representation is simply the parameters from a bunch of distributions alright so here let's say I have four four I want four latent factors and the latent factors are basically the latent variables that describe this image so the images could be images over let's say cats and four latent factors could be the color of fur of the cat the size of the cat the position in the image and the let's say the general lighting of how bright the image is so these could be four latent factors that would explain best the image and from that and if the image could be best reconstructed that's it so the the four latent factors we consider as probability distributions so what our encoder needs your encoder needs to produce eight numbers in this case eight numbers why because for each of these four distributions we want a mean and a standard deviation so these eight numbers here each one or each pair of numbers one of them is going to be the mean and the other one is going to be the standard deviation of a distribution and then from these we're gonna construct a distribution like so like okay here's the mean here's the standard deviation so the distribution somehow looks like this and then we're going to sample from this distribution so one sample could be here one sample could be here one sample here here of course in the middle here we're gonna have more samples but so the whereas the autoencoder directly uses the encoding to reproduce the image the variational or encoder the nd the what the output what the encoder produces here is simply a parameterization for a disk for a distribution and that distribution then is sampled so we're gonna take one sample here so from from each of these so there's gonna be multiple of those distributions because we have eight numbers we are gonna produce four distributions in particular so we're gonna sample four different numbers so we're gonna sample a new vector with four two three four well I didn't have eight at the beginning but here this gives us four numbers but these are sampled so these are gonna be different every time even if we feed the same image and from this the decoder is gonna try to reproduce the image and then again the images the end image and the beginning I mentioned that we forced to be close to each other but also now since this is a probabilistic framework we also kind of need we need a different loss function for the auto encoder you can simply penalize how far the images are ends that say l2 norm but here we have two distinct parts to the last term so and everything is probabilistic so let's walk through this here the first part of the cell we have two parts of the last term and here in particular Q is you can see here it takes as an is it is the distribution of Z conditioned on X and Z will always be related representation of the of the data and X will be the the data itself the data point so Q will take the data point and produce Z and the Z specifically here what's meant is this this thing here this is Z whereas this is this is X and this is also well this is X tilde or something whatever is produced by the decoder so basically what we're gonna do is we're gonna punish the KL distance which is a probabilistic a distance measure we're gonna measure the distance between the distribution of Z under X um with the prior over Z so P of Z here this here is the prior distribution over Z and the prior distribution in vis is often to be taken as a Gaussian so we'll say all right so they are our kind of default assumption on the Z variables is that they're they're gaussians and we're gonna force basically and of course the encoder to come up with encoding generally over the data set that are thousands that are conformal to our prior so here we say specific prior TZ I didn't mean to cross that up right so this second term enforces the the encoder to produce things that are Gaussian and specifically with our if our prior is let's say 0 0 mean unit variance gal since it's gonna enforce that the first term here is different the first term makes the image that has been input to the variational or encoder in the image that has been output close together again this is a probabilistic loss so what we're gonna do here is we're gonna take expectations so the KO distance is also an expectation by the way we're gonna take expectations over px which is the distribution of the data and also over q and q is again our encoding mechanism and we're simply going to punish T or we're gonna here maximize the the log probability which is equivalent to minimizing the negative log likelihood which you might be familiar with of the data given the Z variables so um and this is an expectation over Q given X so what that means is basically we want t the probability of this original data point we want here we output X tilde we we want this to be close to X here so what we can say is we want the probability that our model outputs X which has been the original input right given this particular Z that it produced to be high as an expectation of Q of Z given X so it's a bit cryptic but it means here I input X into Q I get out Z and when I have the Z what I produce here is what I produce the likelihood that X the original image seen either same is produced should be high so that's a variational autoencoders I simply encourage the latent representations to be close to my prior which softened Gaussian and I encourage the output to be similar to the input which I do by encouraging the likelihood that the output is the input all right so cool so what's that have to do with this entanglement this entanglement is properly that now I would like to have in my model which is that these these things here or we can also focus on these things here however you want to view it or these things here these latent things that my encoder output somehow give me information about the data in a way that's disentangled what that means is I vote for it I've made an example this already disentangled where I said let's let's say we have images of a cat of cats and the fur color is going to be one variable and the color of the eyes of the cat is going to be another one and the position in the image is gonna be another one so these are all fairly independent right so I if I change some Layton factor I can change them pretty much independently so here this could be the third color I can change it pretty much independently and cat will just have a different fur and so on what would be non disentangle representations would be um let's say one encodes the fur of the cat and the other one encodes the species of cat because these are these are highly let's say entangled so the fur color is highly dependent on what species the cat is and it's not really so they kind of you can you can imagine it as these things being correlated but it's slightly different and there are there's not an agreement on what this entanglement means really we just kind of imagine data is somehow entangled and we want to kind of pull out these disentangle factors so what they focus on here and the easiest the easiest measure here is the following I might want to have some space alright so the easiest measure of this entanglement that is come up with here is the following it's an assumption the assumption is let's say there's data X right we'll call it random error and we know we know we assume that this data is generated by a bunch of latent variables Z 1 Z 2 Z 3 which are independent which means that and the technical thing for this is that the P of Z which is all of them can be factorized into P of Z I so they are independent and these kind of determine independently the data X now what does the disentanglement when my model has produced a disentangle representation means I now have a model some model and which is going to give me a representation of X and the representation as we saw before could be these things here that's the that representation is specifically what these people do is they say okay the mean of the distribution that my encoder gives me that's the representation of X alright so this gives you a representation of X from which you then might want to you know reconstruct X over here X but so the important thing is when is the representation disentangled the representation astiz entangled in the easiest sense if the following holds when I change when I change Z I so I introduce a delta to Zi to any of these three that means that in the representation of X which which is gonna say so if there's three dimensions of Z we just assume kind of we know that and we also make the representation three dimensional then exactly one factor in this is going to change so if I change one factor of the true underlying distribution which is independently which all the latent factors are independent then only one factor in my representation changes so if that's the case then kind of I can be fairly sure that I've captured the true latent structure of the data right if one if if one of the of the if I change one of the the Z here let's say I change Z 3 and only then R 3 so I change the 3 let's say I've access to the true underlying distribution I asked the the world ask the world to give me a picture of a cat that where the fur color is different and then I put it I get a data point and then I put it through malleable I get a representation and only from the cat that I had before only one of the factors of my representation changes then I call it is entangle then it can be fairly sure ok my representation this dimension of my representation captures the fur color independently of the other factors all right so that's this entanglement and you know this it requires actually access here to the true distribution of how the data is generated by the world so this is something you generally don't have but it's a technical notion so you can you can certainly postulate it and it's it it's a nice framework and this paper basically proves that generally learning this integral representation in that way is impossible if you don't have some if you don't make some assumption some a priori assumptions on your data and your model so this is a theorem here and we see here P is any generative model which admits this factorization right is that that's what we talked about the true underlying generative process is has is independent in so in its constituents that means there's a bunch of latent variables they independently from each other produce a data point right X is the date observations then there exists an infinite family of by ejected Auctions all right search that ha this and this and this and this okay what that means is so this thing here basically just means that the the distributions agree so that the overall distributions there that say that it's not exactly that but the posterior distributions let's say the data looks the same right that what comes out of the process looks the same so there is there is functions that transform the latent distribution into some other distribution but they look the same in cumulatively alright and then this part here means you'll see the derivative of F I of U with respect to some UJ which you'll notice I and J are different and if this means that they're basically the dimensions error and tangled I means that if I take the derivative of one entry in the in the F in the function output and I derive it by another entry then I get a nonzero derivative which means that this UJ influences fi which basically means that I can produce I can take the Z I can transform it in slz is independent so it means the I of dimension has no influence on where J eighth dimension of the of the output and I can transform it into something where that's no longer the case where the I attend the jus dimension very much kind of are entangled are covariate so this I can take the Z that that's kind of everything is independent I can transform it into something where everything is dependent and they give it a nice example here so they say let's say we have gal scenes in two dimensions so we have one Gauss in here and let me see if I can draw this on Gus and here right in two dimensions they're completely independent what you'll find is that the kind of distribution overall has ISIL lines like this right it gives you kind of a hump in the middle two dimensionally you can maybe imagine like a bit of a mountain in the middle all right so this is what you this is the kind of output distribution if you if you don't know about the underlying factories you simply see the cumulative distribution which be the big key here all right now we transform this into with F and F is simply a rotation by 45 degrees right so two new axis this and that and again our two gaussians are going to be transformed of these right so these are not these are not disentangled anymore well in the in the notion I can't say it like this but this is easiest to say so these are these are kind of now that is rotated in terms of the original coordinate system which would go like this these very much depend on each other right jus dimension dimension depend on each other because if I sample from one of the gaussians I need now basically two coordinates to describe where it is or yeah I get one isn't just so if I sample from one Gaussian and I change I need both the coordinates but the cumulative distribution or D it's still going to look exactly the same so it's again a hum so it's basically an isometric hump in every direction if I rotate that the it looks exactly the same this is the P here but now the iof dimensional jf dimension very much influence each other and yeah interestingly if you now look at this entanglement if I just have if if I now produce data X here X 1 and here I produce data X 2 and both go through my model and give me our representation of X 1 and the representation of X 2 I have without seeing the underlying structure I have no idea which one of those two it comes from and thereby I have zero chance basically cell luck lucky guess which one it comes from and there's an infinite family so I will never find the true underlying distribution here and thereby I will never I will never be able to satisfy this property that if one of the Z changes then only one of the factors of my representation will change because if I say oh well obviously this is the case then I'm going to make a different model and if I say well this is the case I'm going to make a different model I don't know which one this off to choose one and it could be the other one so I'm bound to be wrong in this case 50% of the time but if it's an infinite family I'm bound to be wrong every time basically so that's what the theorem basically says I can't decide on the true underlying distribution there's an infinite family that transforms it into it transforms every distribution into some other distribution that has basically complete opposite properties of entanglement and to choose one and I will never choose the right one because I'm not that lucky and thereby I can't do representation learning that's disentangled all right so that's the main claim of the paper and there is a lot of experiments here so what the paper also does is they do some data sets and they test a lot of a lot of our architectures basically they say just because it's theoretically impossible it's not impractical because we can actually make these underlying assumptions like we can make some assumptions on the data in there and then we kind of can attempt to do this entanglement learning so they do these data sets and they test different VA EES architectures on it and they basically establish where more work should go so that's that's kind of the rest of the paper I encourage you to look at the rest of the paper I just wanted to give a quick introduction to the AES and to disentanglement just tangled representation learning I wasn't technically correct in every detail but I hope that it's enough and have fun
Original Description
https://arxiv.org/abs/1811.12359
Abstract:
In recent years, the interest in unsupervised learning of disentangled representations has significantly increased. The key assumption is that real-world data is generated by a few explanatory factors of variation and that these factors can be recovered by unsupervised learning algorithms. A large number of unsupervised learning approaches based on auto-encoding and quantitative evaluation metrics of disentanglement have been proposed; yet, the efficacy of the proposed approaches and utility of proposed notions of disentanglement has not been challenged in prior work. In this paper, we provide a sober look on recent progress in the field and challenge some common assumptions.
We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train more than 12000 models covering the six most prominent methods, and evaluate them across six disentanglement metrics in a reproducible large-scale experimental study on seven different data sets. On the positive side, we observe that different methods successfully enforce properties "encouraged" by the corresponding losses. On the negative side, we observe in our study that well-disentangled models seemingly cannot be identified without access to ground-truth labels even if we are allowed to transfer hyperparameters across data sets. Furthermore, increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks.
These results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.
Authors:
Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, O
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Yannic Kilcher · Yannic Kilcher · 8 of 60
1
2
3
4
5
6
7
▶
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Imagination-Augmented Agents for Deep Reinforcement Learning
Yannic Kilcher
Learning model-based planning from scratch
Yannic Kilcher
Reinforcement Learning with Unsupervised Auxiliary Tasks
Yannic Kilcher
Attention Is All You Need
Yannic Kilcher
git for research basics: fundamentals, commits, branches, merging
Yannic Kilcher
Curiosity-driven Exploration by Self-supervised Prediction
Yannic Kilcher
World Models
Yannic Kilcher
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Yannic Kilcher
Stochastic RNNs without Teacher-Forcing
Yannic Kilcher
What’s in a name? The need to nip NIPS
Yannic Kilcher
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Yannic Kilcher
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Yannic Kilcher
GPT-2: Language Models are Unsupervised Multitask Learners
Yannic Kilcher
Neural Ordinary Differential Equations
Yannic Kilcher
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
Yannic Kilcher
Discriminating Systems - Gender, Race, and Power in AI
Yannic Kilcher
Blockwise Parallel Decoding for Deep Autoregressive Models
Yannic Kilcher
S.H.E. - Search. Human. Equalizer.
Yannic Kilcher
Reinforcement Learning, Fast and Slow
Yannic Kilcher
Adversarial Examples Are Not Bugs, They Are Features
Yannic Kilcher
I'm at ICML19 :)
Yannic Kilcher
Population-Based Search and Open-Ended Algorithms
Yannic Kilcher
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Yannic Kilcher
Conversation about Population-Based Methods (Re-upload)
Yannic Kilcher
Reconciling modern machine learning and the bias-variance trade-off
Yannic Kilcher
Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
Yannic Kilcher
Manifold Mixup: Better Representations by Interpolating Hidden States
Yannic Kilcher
Processing Megapixel Images with Deep Attention-Sampling Models
Yannic Kilcher
Gauge Equivariant Convolutional Networks and the Icosahedral CNN
Yannic Kilcher
Auditing Radicalization Pathways on YouTube
Yannic Kilcher
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yannic Kilcher
Dynamic Routing Between Capsules
Yannic Kilcher
DEEP LEARNING MEME REVIEW - Episode 1
Yannic Kilcher
Accelerating Deep Learning by Focusing on the Biggest Losers
Yannic Kilcher
[News] The Siraj Raval Controversy
Yannic Kilcher
LeDeepChef 👨🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games
Yannic Kilcher
The Visual Task Adaptation Benchmark
Yannic Kilcher
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Yannic Kilcher
AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Yannic Kilcher
SinGAN: Learning a Generative Model from a Single Natural Image
Yannic Kilcher
A neurally plausible model learns successor representations in partially observable environments
Yannic Kilcher
MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Yannic Kilcher
Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
Yannic Kilcher
NeurIPS 19 Poster Session
Yannic Kilcher
Go-Explore: a New Approach for Hard-Exploration Problems
Yannic Kilcher
Reformer: The Efficient Transformer
Yannic Kilcher
[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization
Yannic Kilcher
Turing-NLG, DeepSpeed and the ZeRO optimizer
Yannic Kilcher
Growing Neural Cellular Automata
Yannic Kilcher
NeurIPS 2020 Changes to Paper Submission Process
Yannic Kilcher
Deep Learning for Symbolic Mathematics
Yannic Kilcher
Online Education - How I Make My Videos
Yannic Kilcher
[Rant] coronavirus
Yannic Kilcher
Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting
Yannic Kilcher
Agent57: Outperforming the Atari Human Benchmark
Yannic Kilcher
State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication
Yannic Kilcher
Dream to Control: Learning Behaviors by Latent Imagination
Yannic Kilcher
POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions
Yannic Kilcher
Evaluating NLP Models via Contrast Sets
Yannic Kilcher
[Drama] Who invented Contrast Sets?
Yannic Kilcher
More on: Research Methods
View skill →Related Reads
📰
📰
📰
📰
On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]
Reddit r/MachineLearning
CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available
Medium · Data Science
Found a potential mistake in an ICLR 2026 blogpost [D]
Reddit r/MachineLearning
Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement
ArXiv cs.AI
🎓
Tutor Explanation
DeepCamp AI