Lec 19. Transfer Learning: Data
Key Takeaways
This video explores transfer learning with data, covering generative models, data augmentation, and counterfactual reasoning, with tools such as GANs, Word2Vec, and Style GAN.
Full Transcript
We're going to finish up our short series on transfer learning today. Um, and uh, the the other sort of logistical thing I did want to bring up is I know that project proposals are due this Friday. Um during our instructor meeting today, we put together a rubric for those project proposals just to try to give you guys a little bit more structured understanding of what we're going to be evaluating those on. Um so that's up on Piaza. Um and essentially it's like out of 10 and there's sort of five different things that are each worth two points. Um and then the other thing I wanted to say is that this is actually my shockingly already my last lecture of the semester. Like I don't know where time goes. Um but it has been really uh awesome getting to lecture you this semester and I will probably see you towards the end of the semester. Um anyway, but uh yeah, time flies. Um cool. So today continuing that thread on thinking about how we can transfer knowledge um specifically within the constraints of these deep learning models or these deep learning systems. Um today we're going to be specifically talking about how we can transfer knowledge about the inputs to the system. Um and so we're going to spend some time thinking about how you can think about generative models as another way to capture information about data inputs. Um and then we're also going to talk a little bit about um models that learn to learn. Um, so this idea of metalarning and how you can kind of build models that are almost designed to be good at transfer learning, like designed to be good at sharing information or learning new things. Um, cool. So if we're thinking about knowledge um as um knowledge about the inputs, like maybe one way to think about this is if you know something about the distribution of inputs you're going to expect, um maybe there's a way to actually use that to direct or accelerate your learning, right? um or or even sort of directly target a specific type of learning. Um so you have a data set, right? And then we spent quite a few lectures with Phil um talking about this ability to learn how to generate data from that same distribution, right? This idea that you can use generative modeling. And so now that gives you um access to some model that then is able to sample from that distribution. Um and so here um at a first glance this might not seem that interesting. You had data and then you used it to train a model that mimics the data that you already had, right? Um but you already had the data. So how does it actually help? Like how are you going to get more information out of um out of a system that's sort of been trained to model from a specific distribution? Um but one of the interesting things that uh you can kind of think about this as is assuming that generative model is kind of very good. Um you can almost think about the model itself as a new way or a new sort of mechanism to access data going beyond the capabilities of the initial training data set. So maybe you call it data plus plus, right? Um and what is that additive um value that you're getting from a data generator that you might not get from the data itself. Um and here's like an interesting quote. Um so uh this is from the release of the first stable diffusion model. They said that this release is the culmination of many hours of collective effort to create a single file that compresses the visual information of humanity into a few gigabytes. Um, and that sounds pretty awesome, right? So, essentially it's the idea that actually a generative model is is a really cool form of data compression, right? So, you can take that generative model and you can go from some latent variable to sort of anything on earth like any sort of visual um all visual information of humanity then that is a shocking compression um of sort of everything. you think about even just the storage capacity of like all the images on the internet versus the storage capacity of a stable diffusion model that's a significant difference um in terms of uh what you're capturing. So here um you know how do we think about what data++ means um and and how do we think about data as maybe like a like an object right um so if we start with um you know some x um which is the set of the original training data which is little x um z which is a set of latent variables that can correspond to the original training data learned by our generative model. Um the mapping from Z to X. So this is basically the mapping that goes from some latent um variable to a generated data point. That's G. That's the generative model we've learned. And then if you also have uh G inverse essentially this inverse mapping that lets you go from data to its associated latent variable um you can actually define these um you can take these and define operators over those objects that enable us to do things to the generator of and sort of the system of data plus generator of data that we couldn't easily do to the data itself. Um so for example you can think about interpolation over a data set right um you know pretty standard definition of interpolation but now you can think about um maybe some if you have one set of one sort of set of this um like x mathcat x that's kind of like some data set some latent space a generator for that um latent space to data and the inverse if you have one of those and you have another one that's for some sort of different data set possibly or different generator you You can think about interpolating between the two, right? So some interpolation between those two things would generate a new one. Um and you can think about manipulation, right? You can think about how you might add some bias term to that that would give you again some whole new kind of object um for accessing and interacting with data. You can think about composition. Um so here you know how do you actually kind of compose two things is of course different from interpolating between two things. So now if you like essentially um are taking this space where you have um just these different data products or data objects, you can think about really combining them and and and manipulating them in interesting ways. You can also think about optimization, right? So how you might actually be able to um explicitly find some set of you know generator data laden space inverse generator that's optimal for some given dimension of optimality. Um and you can imagine sort of taking that and bringing it to bear on many many different sort of research domains or problem domains that people are interested in and things like graphics like visualization, data augmentation, um counterfactual reasoning. Um and all of this work is kind of somewhat in progress actually. Okay. Uh so these are just some examples of papers um that start getting at these mechanisms of interpolation, manipulation, composition or optimization over generated data objects. So if we think about this space of generative models, so here you know this is now just visualizing what we were just talking about. We have some set of latent variables. Um so these are maybe the controls, right? This is kind of the way that you control what you want to generate. um and maybe you've made some reasonable um constraints in your system so you can ensure that Z is um you know uh normally distributed. Now you have some generative model G that will take any point in that latent space and then synthesize an image. Right? So here now if you sampled a different point in latent space then you would sample a different image. And so now given the relationship between like dimensionality of movement in latent space and the image that's generated you can think about this as a mechanism of control. So maybe if you move along um some controllable snippet of the manifold of natural images, you can explicitly do things like find some latent variable in your latent space, some dimension in that latency that corresponds to something like uh pose for this bird or um orientation in some way, right? And so maybe there's like different dimensions um of like this control in latent space that correspond specifically to disentangled factors of um variation or change in the space of those real generated images. And so if you can find Yeah. >> Um just to be on that side um does it dimensions are independent >> and is it usually the case? Um so your question is um are you supposing that these dimensions in this kind of latent variable space are linearly independent? Um so I think the assumption here is that there is some decomposable um directions that are kind of maybe you know some projection of this that where you can find orthogonality. Um and that orthogonality then corresponds to some very clean um like specific uh like variables of of variation um in the real image space. In practice um there sometimes particularly for certain types or certain sort of categories of generative models this is true. Um but it's not necessarily easy to find. Right? So you people will kind of do this like cherrypicked thing where they're like hey we found a dimension where if we move in this dimension you can see that it changes um from day to night. We we'll talk more about some of these, but I think one of the kind of challenges of this is particularly as these latent spaces get higher dimensional um being able to really explicitly disentangle these dimensions of variation in a clean way can be very difficult and can be highly heristic. Yeah. Um cool. So if you have that, if you have this kind of like model that takes you from data to what we're calling data plus um and you can map you know now from your actual image now into the latent space itself. So this is that G inverse um now you have the ability to ask these interesting counterfactual questions like what would it look like if you know assuming that you have these measurable and sort of actionable dimensions of variation. Um so here now you could say okay what would it look like if you know we moved around in this latent space like how do these different dimensions correspond to um different maybe things within the manifold of real images. Um so there's some work um where you can actually show that you can improve um your ability to categorize something by building an ensemble over different generated variations of that input data in some manifold. And so you get some essentially improved accuracy and robustness by taking a real input image and then figuring out where that corresponds to in latent space and then moving around sort of nearby in the neighborhood of that thing to find some maybe different poses, different orientations, slightly different realistic manipulations of that input image and then taking an ensemble. Um so it's almost like you're building some robustness into your categorizer based on um some interesting dimensions of data augmentation um almost like test time augmentation um within the generator itself. Um which is interesting right so there's some nice work that shows that like okay actually exploring dimensions of latent space can give you something that actually is more than the original training data. Um though I think an interesting and important point here is like there's so much kind of knowledge baked into this right like the fact that something nearby in that image space should be reasonably with maybe the same category not actually break a category boundary in some significant way seems important. Um and you know I often try to point out like some of these assumptions that are built in some some of the ways that they might fail. So if you are just trying to understand something like cat this is probably very reasonable. If you're trying to understand something about maybe a breed of cat, these dimensions of variation might start to get more confusing in terms of breaking that class boundary because to me, as someone who spends a lot of time thinking about animals, um the first image and the one in the middle actually look like different breeds. That difference in the morphology with that extra length of the ears, to me, if that's this type of c the fine grained level of categorization you're interested in, this might actually be more confusing and potentially break that robustness. So there's always some fundamental assumptions about what you're actually trying to do that are kind of built into the dimensions of variation that are good or bad for a given problem. Um and then you know there's this question of like how do you discover these dimensions in a latent space, right? So one interesting thing you could think about is um if you're trying to discover an interesting dimension of variation in a latent space, there's ways you can do it experimentally, right? You take an image of something. You take an image of the same thing at night for example and then you look at where those both map to in latent space and then you decide you like calculate the vector of direction that goes from one to the other. So now you've almost calculated a vector of direction that should correspond to this specific type of counterfactual. Um and so then you can think about sort of explicitly finding things like um dimensions of variation that zoom out on an image or zoom in on an image or brighten it or darken it. So they have like in this case they were able to sort of find um corresponding specific shifts in the image that can be used to then control images um and ones that are input agnostic. Right? So now you can kind of find this optimal dimen direction or dimension of variation by just kind of like cropping images and then you sort of take an ensemble of like the directions for each of those and maybe you can find something experimentally, right? Um and so here you have these dimensions of zoom. You can find dimensions that correspond to shift even across different input sizes. You have dimensions that correspond to brightening an image. Um and then you have things like corresponding dimensions of darkening an image. though here this is interesting because I would say that this is really capturing some very specific bias in our data right because clearly people are not taking pictures of volcanoes in the dark that aren't erupting right so when we try to make it dark it thinks it needs to erupt right there's like this clear bias that's just built into the training data um and there are some other cool examples of this for other types of modalities as well right um so there's a there's actually a really famous example from like wordtovec which is like an embedding model for um words like semantic meanings um where there's like um a direction in the latent space of models that will correspond to changing the tense of the word. So you can move in that direction and go from swimming to swam um and go from walking to walked by taking the sort of this moving the same direction in that latent space for words. Um so it these these types of um examples of disentangled dimensions of variation in latent space can be discovered for many modalities and have been discovered for many modalities. Um and then uh there's kind of like these explicit, you know, latent space vectors um uh that are disentangling these factors of variation, right? So you have things like winter to spring or turning on the lights, going from day to night. The volcano eruption vector, though, again, you'll note that it's like you're more likely to get like a more aggressive looking um eruptions at nighttime than during the day. Um so there's been a lot of research into how to efficiently discover these dimensions of variation within a learned latent space these disentangled dimensions of variation. So um let's look at that in a little bit more technical detail. So um you can take a GAN like a generative adversarial network and you can walk in a straight line in a latent space and visualize what it looks like to walk in that straight line. Um and it kind of works but it can be really cherrypicky. So like some of those like directions um correspond to stuff that just looks like trash or is just not easily disentangleable into something that we can interpret. Um and then there's some like sort of fancier versions of a GAN like style GAN where they actually explicitly noticed that um sort of any layer of the model any of those intermediate representations also represents a latent space for the model, right? Um and so they determined that there's actually earlier layers in the model um that are better quote unquote latent representations than the Z representation, that sort of initial input representation. And so um so then we'll try and talk about like why that might be like why it might be better to actually think about the latent space um from somewhere intermediate in the model versus like explicitly with that input latent vector. Um so if we have some natural image manifold X so some nonlinear data space um and now we sort of have some starting point. So so this is the corresponding point in that data sort of natural image manifold corresponding to this bird this blue bird. Maybe we have another point here. Um this is corresponding to this fly. So if you think about linear interpolation between these two points, arguably you're moving off the manifold of real images, right? You're kind of like there's there's stuff in one end and in the other that um sort of are more realistic looking, but this sort of direct linear interpolation in image space just gives you stuff in the middle that's basically these like obvious kind of just additive images, right? um and it doesn't look anything like any realistic natural world image would, right? It just doesn't match the statistics of real natural images. So instead, if we have some sort of latent space that we've, you know, now we've learned this data plus representation, now we can we have a latent space that's sort of well behaved and we can map from that directly to that natural image man that natural image manifold. Then you can imagine that interpolating linearly in the latent space would correspond to a nonlinear interpolation in that data manifold space. But it would ensure that you're mapping within the manifold of real natural images the whole way. Right? Because by definition we've constructed something where any point in that latent space will map to the natural im manifold. Um, so now if you do that same interpolation, um, and this was from a paper called Big Gen back in 2018, you do get stuff that at least looks a little bit more like natural images, but they might not really map to reality, right? Like you get some like really weird things like this like half bird, half fly thing where basically the model's just doing its best, right? it's doing its best to sort of find some way to reasonably interpolate between these two things and can stay within the constraints of that trained natural image manifold mapping. Um but that zspace so this sort of zpace here where we're mapping like from one point to another in a linear way um might not be the best way to sort of organize the data to interpolate along. And so let's look at why that might be. So um it goes back to the lecture from um VAES if you guys remember you have this data distribution that maps to a latent distribution and you have a latent distribution that maps to a data distribution. And so at the bottom we're sort of coloring the correspondence between those two spaces essentially like if you're going to map these three things back and forth. So here like if I say like the same color um on the natural image manifold of like green for example would match the same color green in the latent space. So now let's visualize what happens as this gets trained. Um so here if we're watching this thing get trained. Um you can you can kind of see how it's like going from the natural image manifold to that latent space. It's almost like you've taken like something and try to crumple it into a ball. And so even though we've said explicitly that like okay everything in that latent space should correspond to something on the natural image manifold actually there's like this weird stuff here right where like there's there's like danger zones um where where essentially here like h it's going to be closer to the natural image manifold but if you're going from here to here like there's going to maybe be seams in that line linear interpolation that will correspond to like pretty jarring um uh sort of shifts um where things like that are nearby in latent space could potentially be quite far away from each other in data space. Um and so essentially that like Z representation if you're doing linear interpolation across that seam it could be unnatural right and that might might be one of the kind of um ways to motivate the the sort of discovery that they found in the Skyan paper which is essentially that um using some intermediate representation somewhere in between that's not quite fully convex but maybe like a little bit less distorted. Um, so some like kind of intermediate thing where you're possibly less likely to kind of jump across a seam or sort of fall into some danger zone. Um, and basically just somewhere in between might be the right place to interpolate and get things that look more realistic and kind of behave more um in a more friendly way. And empirically and qualitatively that's sometimes true. It's our favorite way to say things, right? Eh, sometimes. But it but it does actually sort of correspond in some nice ways, right? Um so here uh this was a work where they took that W space that sort of intermediate space and they demonstrated that they could use it um using this sort of this was a work called style space to very cleanly isolate and disentangle these dimensions of variation and they got some really nice qualitative results. Right? So here they're able to vary things at the level of granularity of like hood styles or headlight types or body colors or background. Um so uh yeah so it does seem to suggest that um basically by removing some of those like weird badly behaved portions of um linear interpolation by moving to this more intermediate space is beneficial when you're trying to find these disentangle dimensions of variation. Cool. So now let's talk about how you would label this generated data, right? So you have a label for all your real images. Well, maybe you don't, but assume you do. Assume that the the data you started from that you're training this generative model on, you had labeled. And now you have um you know some space in here that corresponds to that real data. But then there's a lot of other space that corresponds to purely generated data, right? So the argument here is that by construction that things that are close together in this manifold um should be semantically related. And so if you don't move too far on the manifold, you should argue that the category should stay the same or at least this is the assumption, right? And so then you can have kind of this you can rely on the normal inductive bias of machine learning that assigns similar labels to nearby points if it's reasonably well semantically clustered. So here if you know for example these are real data points where you know their true label then you could assume that um things sampled from those similar um data points would actually correspond to similar things. Um, and so here, you know, maybe in this sort of region, yes, there's dimensions of variation that are captured and they're but they're not changing the underlying semantic meaning of the object itself. That the these small variations are more corresponding to these dimensions of variation um in the context um in sort of maybe the scene, the the the visualization, but it's not changing the category that's being captured. Um and so this was explicitly explored in this paper called data set GAN where they basically uh trained a GAN on one type of data and then they wanted to solve a new task related to that data. Um so specifically labeling the parts of a car um with uh semantic segmentation. And so then they what they did is they used style GAN to efficiently label data in terms of like the semantic segmentation labels which can be quite expensive by defining an additional labeling arm. Um which is essentially the output of a style GAN with just a few examples. And then you can use that model to generate a really large data set of images and corresponding labels that are weak labels, but they're labels that are sort of explicitly based on this well ststructured, well-aligned um latent space. And then you can train a part segmentation model on the synthetic data and test it on the real data and show that it's actually beneficial. Um so being a little bit more specific about this um essentially if you're trying to train that labeling arm um they do that using the latent space and then just a really really small number of manually labeled examples. Um and this is valuable particularly because semantic segmentation labels are really expensive. How many of you have ever actually tried to label data for semantic segmentation? Yeah, it sucks, right? like it's super slow um trying especially like the parts that are really fine grained like the boundaries of objects it just takes forever to really try to get it very correct. Um and so being able to kind of efficiently benefit from just a few of those and then be able to generalize well is pretty valuable. Um so essentially here you map human labels in the pixel space through the network to some of these intermediate features that have um some sort of maintained spatial relationships and arrangement. Um, so basically you take the original data and then you run it through what they're calling a style interpreter, but is essentially something like a style GAN that gets you a representation um of like these specific P pixel vectors um that you can then train a model a really efficient um model predictive model to map between those features and part labels. So essentially it's like relying on the fact that training this generator forced the model to learn a really well-behaved well-aligned representation space in terms of these dimensions of variation and similarity. And then they could show that you could really efficiently train these accurate predictive models because the features themselves were better organized and more semantically meaningful than the pixels that you might be starting with. So instead of trying to sort of go from images and and pixel labels to a trained model that generates pixel labels given this input image, you basically take the input image, build run it through to some intermediate feature space and then you've trained this really lightweight predictive model to go from those features that are learned by a generative model to um the the correct semantic segmentation. Um and they had some pretty nice examples of how they were kind of able to do this. And qualitatively and quantitatively they were able to show that this worked quite well. Um and they got this kind of um nice uh result which is that a single labeled GAN image was worth about a hundred labeled regular images just in terms of the training efficiency versus the same amount of accuracy. Um and uh these types of things are increasingly commonly explored um particularly for these types of labels that are really expensive to collect. Um and this this um interestingly like this question of like how do you efficiently train these types of things based on well-learned representation spaces or foundation models even ones that are not necessarily generative has also been really well explored. And so these days, if you're trying to do semantic segmentation, most of most people start from something that's called SAM, the segment anything model, which is a model that relied essentially on building a really large like data set using some pretty cute hacks for self-supervision. You take an object and you use copy paste augmentation to guarantee that you kind of have that correct mask. Now, in a bunch of different relationships and orientations, you use um some similar ideas from something like Dino V2 that really captures fine grained semantic meaning in those feature spaces and then you're able to learn a really generalizable and robust segmentation model on top of that. Cool. So then you can also about think about how these generative models this like data plus plus approach can start to teach you how to explain things or sort of increase the interpretability of some of these dimensions. Um so in standard classification we take an image we run it through a classification model and it tries to predict cat right it's like okay tries to predict a category. Um, but now the question is kind of like can we say something interesting about why the image was classified as a cat using this ability in this data plus space to explore counterfactuals. Um, so here you take the cat, you take it from this data space and turn it into this quote unquote data plus space where now you have some maybe encoder, some generator. Um so you're mapping through essentially your G inverse and then through your G to get the same data. But now what you can do is you can look like from this predicted class um using from the sort of generator looking at the classes that are predicted you can now explicitly try to perturb in the latent space to understand how that would cause your prediction to change. So this gives you some ranking of the sort of changes in the latent space and manipulations of these latent variables that maximally lead to the idea of changing a concept. Um so it helps you understand what maybe sort of features of this image are most identifiable as the category of interest. And so here this like style X it's like almost like style space explanation approach tries to find the top k style space directions. So these like latent space directions of variation that will most affect which class the model predicts. And so here um there's some like interesting stuff, right? You can look at like what happens if you open the mouth of the thing. And it turns out that that is a really strong dimension of variation for moving the prediction from cat to dog. Probably because dogs tend to pant a lot more than cats do. Um so it's very cute interpolation. Um [laughter] uh you can also look at um uh sort of changing something about the size of the pupils relative to the eyes and that also shifts the probability that you have something like a cat. Or you can think about sort of uh changing how pointy the ears are. And that also tends to correspond from an explanability standpoint to why the model thinks it's a cat versus something else, right? Um, and so this kind of gives us this sense of like it's it's just another mechanism to try to probe the underlying things that the model has learned or has understood, right? And you could imagine that this does tell you again something about the underlying bias in our training data. Um, because there are many cats that have ears that are kind of shorter like that, right? There's there's a like British shorthairs have short short ears just like that. But more often than not, something that has long pointy ears is a cat and maybe something with the shorter or folded over ears might be a dog. Um so it's it's capturing some dimensions of bias I think in the data as well. Um and so you can think about this in terms of like how we generate um maybe class specific explanations. So here if you have something like a perceived age classifier um then you can start really probing these like weird dimensions of bias um and see how that affects um the perceived age of a model. So now the model is predicting um an age, right? Um, and so one of the things that's quite odd here is you'll see like specifically that like thicker eyebrows are corresponding to um more youthful age. Um, lighter skin is corresponding to more youthful age. Though also you'll note that like it's like lighter but also sometimes less less textured. Um, which is like one of these dimensions. it's not a super super disentangled dimension of variation or something like adding glasses um or you know very obviously going to like gray hair. Um so these were all things that were sort of learned in terms of finding these dimensions of variation that corresponded to specific types of prediction. Um and it maybe makes sense, right? Like older people might be in terms of like your data bias more likely to have white hair, right? Like that that's something that's actually pretty ubiquitous. Um but you could also imagine um that some of these dimensions of variation would not be something you would want a model to necessarily learn. Right? This is these might capture dimensions of bias that may be related to the fairness the equity of a model that we deploy. Um so then what do you actually do with what you learn? Right? Um well so one way we could look at this is like well instead of taking it um looking at things like these dimensions of variation from you know internet images can we take it to scientific data or med medical data. So here um this is using that same type of explanability mechanism from like from a generative model to try to understand how to categorize different um retinal fundus images. Um, and here we're showing sort of the top four examples of things that real doctors are looking for. Um, and then how you might actually change um that image maximally to correspond to something predictive of a specific disease, for example. Um, and this is quite interesting because it does actually if you show these to a doctor, it does correspond with the types of even very fine grain features that a doctor might use to actually categorize a specific type of disease. So from that sense, from an explanability standpoint, it's almost a way to build trust in a model because an expert would say, "Okay, yeah, no, I do agree. I do agree that that is a reasonable dimension of variation um that I also would correspond to a specific type of disease." It's almost a reassurance that the model isn't picking up too many odd other correlative factors that might be things that we don't want a model to um correspond to. But of course, one of the really big limitations of all of these approaches um is that you are required to figure out how to um you have to discover those latent space variations and then manually analyze or define what they might correspond to and that takes a lot of effort, right? So getting to the point where they had these examples of kind of like these specific dimensions of variation that corresponded really well to these real diseases. Um I'm I'm sure it took a lot of handcrafting and a lot of time. Um and so uh recent work tries to do similar types of counterfactual reasoning or understanding. Um but here um actually just using text conditioning for sort of modern diffusion models or other types of conditioning using diffusion models that give us really simple and interpretable mechanisms for control. though of course there's not always perfect alignment between like maybe your text control and what actually gets generated by the generator. Um but it it does make this ability to test these counterfactual hypothesis just hypothesis in many ways like pretty efficient. So I showed this um work actually before but this is work where they very specifically looked at using um almost like human intelligence to generate possible counterfactuals using text conditioning on generative images and then explicitly testing quantitatively the performance of models given these different sort of human human derived counterfactuals. Um, and they showed that, you know, it's it's just a really nice way to to do essentially like interfaces with your data sets. Basically using the generator trained on top of data as a direct um investigation mechanism or um interaction mechanism with the underlying uh real data set itself, understanding the dimensions of bias. Cool. Um and then I think there's also this question of like okay the data the generated data is potentially useful for explanability or for data set exploration but is it also useful for representation learning right like can you learn from generated data in the same way that you might be able to learn or maybe even in a better way than you might be able to learn from real data and this is where it gets kind of complicated. Um so here's a work where they basically said okay you have some data set X um and then you're going to train a generative model on top of that data set and the idea is somehow the generative model trained on top of the data set is going to capture um either the same information as the data set or maybe arguably slightly more information than the underlying data set um and then that might actually be able to be useful for representation learning. So here, kind of similar to that other example with the multiv- view ensembling from GANs, they're explicitly looking at your ability to generate maybe the same real image from different perspectives or from uh different views and then use that um as the input to contrastive learning models. Um, so here trying to learn a representation space where you know you're saying all right here's two images from the same sort of category here's one from a different category and then you're training that contrastive style loss. >> Um, yeah. >> So is this like training from scratch or is this transfer learning? >> Yeah, that's an interesting way. It's an interesting question, right? Um, I mean arguably this is this is a mechanism of transfer learning. Um assuming that >> well I guess to get to the original model was that built on some like pre-trained model for image analysis >> which original model we're talking about kind of like a system here. So, so this generative model, so um let's assume for the sake of simplicity that this is trained from scratch on that because otherwise it gets even more complicated to understand, right? And this actually I think is one of the the one of the difficulties currently when people are trying to understand whether we can like there's there's this kind of like broader meta argument going on in the community of like whether it is possible to get more information out of a generator trained on data than the underlying data itself. It's basically from an information theory perspective. It's like how how could you possibly create something from nothing, right? If you have the data and that's all the information you have, you couldn't maybe get more information out of it. But then there's kind of the counterargument which is actually like our architecture design and like the the design of the training systems builds in knowledge that would not necessarily have already been built in. And actually you could argue that the construction of things like convolutional neural networks fundamentally are an injection of additional knowledge beyond just the in underlying intermediate data. Right? um we're building knowledge into the design of our architectures and our loss functions based on our understanding of what for example structure we expect to see. Um so convolutional neural networks basically say we expect there to be local structure in images. Um and so then the argument is like isn't data augmentation like a mechanism like you could argue that random flipping is a image generation algorithm right I mean depending on how you want to define that semantically it is right you've taken an image you've used some algorithm to generate a new image that you guarantee like still has fidelity to the class that you care about um So, right. And then like things like um like this kind of copy paste like cut and paste style data augmentation that's even getting maybe closer to something like a learned generative model because now you're like building explicit assumptions about like taking foreground object objects and putting them on different backgrounds. This is like a really useful mechanism for increasing the robustness of machine learning models. Just like kind of these engineering hacks about how we manipulate the data during training. And then maybe the argument is like how is that any different from something like learning a generator that can interpolate maybe in something closer to the image manifold. Um so it's a bit semantic but um I anyway I I I've actually been thinking about this a lot. I have a grad student who's been looking at whether we can um I really really specifically want to be able to learn maybe using generative models to do a better job a more robust job of recognizing rare things. But now you've really gotten yourself in a in a chicken and egg problem because because the thing is rare it's hard to train a generator that does a good job of generating it. And then sort of the reverse is also true. And there have been people that have shown that you can improve rare category categorization using generative models, but they cheat and they train the generative models on a lot of data. So, so far we have been able to find ways where you can get some gains in very specific constrained scenarios, but it's not just sort of naive and simple. Anyway, so the point here is that you can take the sort of classical contrastive learning approach where we're using kind of our like knowledgebased image generation where we've like done these specific types of warping or whatever um random cropping, flipping to the initial image color jitter in a way that we know doesn't necessarily break the category boundary. We still want it to be the same thing. And here we're instead going to move just a little bit in some latent space to create um now this mapping in data space through what should be the manifold of real images and then use that as your transformation for your contrastive learning model. >> Yeah, >> I had a question about this which kind of relates back to earlier when we talked about like the idea of small local movement in the latent space not crossing a boundary. Is that like a I mean intuitively right if you have just a 2D latent space and a 10 classification problem like you could imagine some very naive pie chart that has like 10 slices and like there has to exist a boundary somewhere between like category one and category 2. >> Yes. So is this just a probabilistic argument that like in a a high enough dimensional latent space the odds that you would sample a location where the boundary shifts is like so unlikely that you can get away with this kind of a approach to training for contrasted >> um or even um I would even go beyond that and I would even say like as long as as long as you define some dimension of uh movement as long as most of the time it's not crossing a class boundary If it does cross a class boundary a very small amount of the time still by the way we train these models it's almost like it'll come the noise will come out it's almost a mechanism for regularization you know like of course there's counterarguments to that right like if um it depends on kind of how egregious those boundaries are and how frequently you're classing them but um yeah I would say like as long as like if you want to make like a a probabilistic argument um and you want to sort of map that to something that's like a physical area argument like the area of the model where moving in a small amount corresponds to a reasonably like maintaining the category boundary um relative to the areas where you are crossing category boundaries as long as that um ratio is quite large then I imagine you would still learn something useful even if sometimes it's wrong it's just like how in the standard um contrastive learning um objectives we did see some improvement when you could explicitly remove these like false positive false negative type things where within your batch you would have maybe like two images of dogs and now they're being treated as a different category during your representation learning. You're explicitly saying those things should be far apart. Um though they actually represent the same category. Um but that's still like if you could remove all of those models did train better, right? There were there were like these nice examples where you could show if you removed these kind of u false characterizations by using supervision, for example, you got better representation learning. But the model's still able to learn useful representations even when those do exist in there. Yeah. Um cool. So yeah, now here it's just a different mechanism for constructing these positive pairs to train your contrastive learning algorithm. And so here um as opposed to sort of your standard SIM CLR type views where you have sort of two views of the same thing where it's maybe like cropped from different areas or warped in terms of the color. Um now you're taking latent views, right? you're taking um sort of some ball around the real image and any dimension you'd move in that ball you could argue would be a different view of the same thing. Um and then you know now uh now just how we might say okay these are all different views of the same thing for the purpose of contrastive learning um here you can say these are all different views of the same thing. Now one interesting dimension here is kind of um just generally is this question of like uh diversity versus fidelity when it comes to the usefulness of training signal. This is also something I've been thinking about a lot. So it's probably more useful to know that two things that are more different from each other are in fact the same than it is to know that two things that are almost identical to each other are in fact the same. So for example, if you look at uh this middle set here where these are all American robins, the pose of this American robin and basically all of these is nearly identical. The construction of the image, the sort of um orientation etc of the image is almost identical. There's not a lot of diversity here. And so actually from a learning signal perspective, this might be less useful arguably than something that also does something like random flipping, right? Um and so I think there is this kind of interesting challenge with these generative models broadly um with GANs with diffusion models etc. Um even with some of these like personalization style models that are trying to kind of do a really nice job of capturing something specific um often there's almost like a paro frontier with these models when it comes to this trade-off between diversity and fidelity. And so you can't get the model to generate things that are really diverse maybe in terms of their cont their um context or their scene structure um without giving up fidelity. Right? you end up in this space where you can generate a robin that's you sort of positioned differently in the scene maybe flying different pose and you can try to force the model to do that and it will but it probably won't actually be very like robinl like anymore if this is kind of the input image that you're you're working from. Um yeah >> in the lat space when you're like moving around image like feature vector you just say that like the images around are like belonging to the same class how do you know how far you can go and that distance is it the same for like all classes can like imagine like classes like one class like over >> Yeah. Yeah. So the question is basically how do you know how far you can go in the latent space and have it still map to the same category and does that distance correspond is that is the optimal amount of distance you could go for any given category the same across all categories. So I would say like experimentally in a lot of these papers they're they're using this as a hyperparameter right they're basically saying like all right we're going to define some radius in some you know something like you know cosine distance and we're going to sample within that radius and we're going to test for different radiuses or radi um and then we're going to pick the one that does the best on our test data. Um, that's a bit like disappointing because I actually think that's a really interesting question, right? Like how do you determine how far you can go before it crosses a class boundary? And almost by definition, there's some um there's you start getting into these really messy questions, particularly as these these models are able to generate things that are realistic looking but but impossible. So um so I me and Phil actually argue about this quite a lot. If you had a picture of your mom, but she had another eye in the middle of her forehead, is it still your mom? Right? What do you guys think? Is it still your mom if she has three eyes? [laughter] There's not a right answer here, right? And so, um, if some of the dimensions of interpolation you can move into are still kind of like realistic looking but actually impossible. So maybe if you think about this from a species categorization perspective, you take like this American robin and then you make the chest, you know, gray instead of red, like that's not a real bird. So now you have some now you have things that look like birds, but they don't actually correspond to any real category of bird. And so then what should that be categorized as, right? do. And so somehow like understanding the boundary of a class, a really important component to being able to do that is having representative data from all the dimensions of real variation for that class. So you can define what the reasonable class boundary is. That's one of the reasons that fshot learning is so hard because it often means we don't have a good well understood representation space of the boundaries of the real class. And so we don't know, right? If you've never seen a picture of a bird as a juvenile, we don't know what the real what the sort of real class boundary should be. >> Yeah. >> Isn't that
Original Description
MIT 6.7960 Deep Learning, Fall 2024
Instructor: Sara Beery
View the complete course: https://ocw.mit.edu/courses/6-7960-deep-learning-fall-2024/
YouTube Playlist: https://www.youtube.com/playlist?list=PLUl4u3cNGP63URZnh5iqBzDTDYPUTQT-8
This video explores transfer learning with data, covering generative models as data augmentation, domain adaptation, and prompting techniques.
License: Creative Commons BY-NC-SA
More information at https://ocw.mit.edu/terms
More courses at https://ocw.mit.edu
Support OCW at http://ow.ly/a1If50zVRlQ
We encourage constructive comments and discussion on OCW’s YouTube and other social media channels. Personal attacks, hate speech, trolling, and inappropriate comments are not allowed and may be removed. More details at https://ocw.mit.edu/comments.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from MIT OpenCourseWare · MIT OpenCourseWare · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
21. Post Trade Clearing, Settlement & Processing
MIT OpenCourseWare
10. Financial System Challenges & Opportunities
MIT OpenCourseWare
7. Technical Challenges
MIT OpenCourseWare
3. Blockchain Basics & Cryptography
MIT OpenCourseWare
19. Primary Markets, ICOs & Venture Capital, Part 1
MIT OpenCourseWare
1. Introduction for 15.S12 Blockchain and Money, Fall 2018
MIT OpenCourseWare
Chalk Radio, A Podcast about Inspired Teaching at MIT (Teaser)
MIT OpenCourseWare
Nuclear Gets Personal with Prof. Michael Short (S1:E1)
MIT OpenCourseWare
How Africa Has Been Made to Mean with Prof. Amah Edoh (S1:E2)
MIT OpenCourseWare
Making Deep Learning Human with Prof. Gilbert Strang (S1:E3)
MIT OpenCourseWare
Social Impact at Scale, One Project at a Time with Dr. Anjali Sastry (S1:E4)
MIT OpenCourseWare
Film is for Everyone with Prof. David Thorburn (S1:E5)
MIT OpenCourseWare
Lecture 12: Aircraft Performance
MIT OpenCourseWare
Lecture 3: Learning to Fly
MIT OpenCourseWare
Lecture 13: Interpreting Weather Data
MIT OpenCourseWare
Lecture 21: Weather Minimums and Final Tips
MIT OpenCourseWare
Hand-on, Minds On with Dr. Christopher Terman (S1:E6)
MIT OpenCourseWare
Part 4: Eigenvalues and Eigenvectors
MIT OpenCourseWare
Part 5: Singular Values and Singular Vectors
MIT OpenCourseWare
Part 3: Orthogonal Vectors
MIT OpenCourseWare
Part 2: The Big Picture of Linear Algebra
MIT OpenCourseWare
Part 1: The Column Space of a Matrix
MIT OpenCourseWare
Intro: A New Way to Start Linear Algebra
MIT OpenCourseWare
9. Chromatin Remodeling and Splicing
MIT OpenCourseWare
28. Visualizing Life - Fluorescent Proteins
MIT OpenCourseWare
20. Roth's theorem III: polynomial method and arithmetic regularity
MIT OpenCourseWare
8. Szemerédi's graph regularity lemma III: further applications
MIT OpenCourseWare
19. Roth's theorem II: Fourier analytic proof in the integers
MIT OpenCourseWare
12. Pseudorandom graphs II: second eigenvalue
MIT OpenCourseWare
1. A bridge between graph theory and additive combinatorics
MIT OpenCourseWare
Special Episode: Teaching Remotely During Covid-19 with Prof. Justin Reich
MIT OpenCourseWare
Spring 2020 Update from Dean Rajagopal
MIT OpenCourseWare
S1E7: Unpacking Misconceptions about Language & Identities with Prof. Michel DeGraff
MIT OpenCourseWare
Climate 101 Live
MIT OpenCourseWare
Welcome for Volunteers (for EarthDNA's Climate 101)
MIT OpenCourseWare
Learning to Fly with Drs. Philip Greenspun & Tina Srivastava (S1:E8)
MIT OpenCourseWare
Thinking Like an Economist with Prof. Jonathan Gruber (S1:E9)
MIT OpenCourseWare
2. Cyber Network Data Processing; AI Data Architecture
MIT OpenCourseWare
1. Artificial Intelligence and Machine Learning
MIT OpenCourseWare
2: Resistor Capacitor Circuit and Nernst Potential - Intro to Neural Computation
MIT OpenCourseWare
14: Rate Models and Perceptrons - Intro to Neural Computation
MIT OpenCourseWare
4: Hodgkin-Huxley Model Part 1 - Intro to Neural Computation
MIT OpenCourseWare
18: Recurrent Networks - Intro to Neural Computation
MIT OpenCourseWare
3: Resistor Capacitor Neuron Model - Intro to Neural Computation
MIT OpenCourseWare
15: Matrix Operations - Intro to Neural Computation
MIT OpenCourseWare
13: Spectral Analysis Part 3 - Intro to Neural Computation
MIT OpenCourseWare
16: Basis Sets - Intro to Neural Computation
MIT OpenCourseWare
20: Hopfield Networks - Intro to Neural Computation
MIT OpenCourseWare
8: Spike Trains - Intro to Neural Computation
MIT OpenCourseWare
7: Synapses - Intro to Neural Computation
MIT OpenCourseWare
19: Neural Integrators - Intro to Neural Computation
MIT OpenCourseWare
5: Hodgkin-Huxley Model Part 2 - Intro to Neural Computation
MIT OpenCourseWare
6: Dendrites - Intro to Neural Computation
MIT OpenCourseWare
17: Principal Components Analysis_ - Intro to Neural Computation
MIT OpenCourseWare
12: Spectral Analysis Part 2 - Intro to Neural Computation
MIT OpenCourseWare
11: Spectral Analysis Part 1 - Intro to Neural Computation
MIT OpenCourseWare
9: Receptive Fields - Intro to Neural Computation
MIT OpenCourseWare
10: Time Series - Intro to Neural Computation
MIT OpenCourseWare
1: Course Overview and Ionic Currents - Intro to Neural Computation
MIT OpenCourseWare
The Power of OER with Profs. Mary Rowe and Elizabeth Siler (S1:E10)
MIT OpenCourseWare
More on: Prompt Craft
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI