Distribution Augmentation for Generative Modeling

Connor Shorten · Advanced ·🧬 Deep Learning ·5y ago

Skills: LLM Foundations80%Multimodal LLMs60%Advanced Prompting50%

Key Takeaways

The video discusses Distribution Augmentation for Generative Modeling, a technique developed by OpenAI that improves generative models using data augmentation, and explores its applications and benefits in various scenarios, including image generation and sequence modeling, utilizing tools like OpenAI's Image GPT model, GPT-3, and FAISS.

Full Transcript

this video will explore distribution augmentation for generative modeling developed by researchers at open AI in addition to excitement about GPT three and its ability to generate stories and write code open AI has also recently published their imaged GPT model this was a 6.8 billion parameter auto regressive generative model similar to how GPT three works but on image net images and modeling pixels rather than language tokens distribution augmentation is a study about how to use data augmentation to improve generative modeling particularly with these auto aggressive models data augmentation describes semantic label preserving transformations to data like rotating it or making it more blue or red that has been a massive workhorse for deep learning and computer vision dist aughh takes on a multi task approach to embed the transformation of the data into the start of sequence token making it so the model can distinguish the true P of X data distribution compared to P of T of X after the data has been augmented or transformed the paper ends with a plot showing the benefits of more augmentation and more scale hinting that this could be a huge contributor to an image GPT to sequel this video will explain the algorithm and go through the details of the experiments in distribution augmentation from open AI [Music] this video will explain distribution augmentation for generative modeling developed by researchers at open AI this work suggests continued interest in developing these large-scale autoregressive generative image bottles and perhaps developing an image GPT to model data augmentation has been one of the key workhorses in computer vision models and deep learning this is where we take images and then we augment them by rotating them or doing these color injections or blurring them with something like a Gaussian blur adding noise or all these things that we can do to images in order to either do these two different ways of looking at how exactly data augmentation is improving these models one view of it could be that it's preventing overfitting because we just have more data and we're not overfitting to this one view of the cat and we have these other views that kind of broaden out the generalization and kind of the region in space of this cat image is occupying in this high dimensional image manifold so then another way of looking at it would be that this is a way to inject inductive biases it isn't just that we're increasing the size of our dataset is that we're telling it to be invariant to rotating the cat it's an inductive bias to tell it that that an upright cat is still a cat when it's rotated or when it's pink it's still a cat so we can either see it as these two ways is preventing overfitting by giving us a larger training dataset and we're also this is a tool to put in these inductive biases of these priors into our deep learning system so a quote from the paper is that gains are not only due to the scale of augmentations applied but also due to the helpful inductive bias of certain transformations so data augmentation is not just about making a bigger dataset it's also about the ways that these data augmentations are providing priors to the deep learning system personally I've been really interested in data augmentation as a tool for enhancing these deep learning models and using this kind of data space regularization in this survey paper I basically just tried to find all the different image data augmentations that are being used and found things like these geometric transformations the rotations the translations random erasing where we're blurring out the cropping out a region like a rectangle in the image color space transformations making the cat pink mixing the images just smashing the pixels together taking some average of two images and then kernel filters like applying a Gaussian blur over the image and then some other more interesting things like adversarial training and when you produce adversarial examples and then you train the model on those adversarial examples kind of what happens as you do that at scale can we use neural style transfer this kind of funny thing of converting the style of like a van Gogh painting into a cat image could that be useful for data augmentation and congenital x' be used as data augmentation for image classifiers and then these meta learning approaches that have this like higher level of controlling the strength of data augmentations so if you're interested in looking at this survey paper it's just an overview of all these different kinds of data augmentations that are being used in miscellaneous computer vision papers so it's very obvious how to apply data augmentation to supervised learning algorithm you just augment the data and then just add it back to the training set and just treat it like any other training data point it's also been pretty easy to apply in contrastive self supervised learning where we augment to use of an image and then use those as the positive pairs and then the other images as the negative pairs do this self supervised representation learning but applying data augmentation for generative models like Ganz these auto regressive models or variational autoencoders or flow based models is a lot harder because these augmentations leak into the generated data distribution this generative model is learning this P of X the distribution of the data and as you start to apply P of T of X T being the transformation the generative model is not going to know this and it's not going to be able to invert the true P of X so it's going to start producing data that's been augmented so this is some pictures from training ganz with limited data where they show that if you rotate the images the Gann produces rotated images as well if you apply a blue histogram shift the Gann produces the generator produces blue images as well before we get into how this AAG is going to use data augmentation for auto regressive generative models here's some of the previous work that's been really successful with applying data augmentation in the generative adversarial Network framework on the left is balanced consistency regularization the idea here is to have a consistency loss on the discriminator between original images and then augmented views of that same image so you would have something like an l2 loss on the log the real fake logit prediction on the two images contract like a contrastive loss learning algorithm or your contrasting the difference between a real an augmented view of that image and then this latest approach training generative adversarial networks as limited data uses this discriminator goggles framework where the real and the generated data both go through an augmentation before they get to the discriminator and the idea there is that this augmentation has to be carefully designed such that the discriminator can invert the P of X from the strength and the you know the way that you do this sampling of the augmentations the idea behind dist aughh is inspired by it approaches to multitask learning where you'll condition on some task information to inform the model about what task it's currently performing so rather than just training the model on modeling the density of augmented images sister you're doing Pia theta of X in this image and then PF ADA of t1 of X T 2 of X and T 3 of X where t1 t2 and t3 our augmentations sampled from a family of transformations T capital T and so this is describing the family transformations it could be something like Rand augment from researchers at Google where they have this curriculum defined by n which is the number of data augmentations to apply in sequence and then M is the strength of those augmentations so you might apply three augmentations like rotation translation and then horizontal flipping and then they have a magnitude like rotated 30 degrees compared to 5 degrees or 45 degrees and then how strong we're going to zoom in translate all these different things so you're sampling these data augmentations from a family of transformations that would be denoted capital T although not in these images so the idea behind dis dawg is that we're not just going to model the density of the transformed images and treat it as if it's any other data point in the data set that's learning this autoregressive conditional probability model where you're modeling say you're at this pixel you're putting the probability mass on this pixel based on all these previous pixels that you've seen it is just kind of like left-to-right auto regressive generative model but the idea in dis talk is to condition on the transformation of the data so we'll have these embeddings where we map the different augmentations into some like dense vector representation and we're going to condition the auto regressive generative model by injecting this into the start of sequence token so it's going to be doing this multi task learning where it's not just going to model the probability density as if this rotated pink cat is any other image of the cat rather is conditioned on this transformation of the image in this slide we'll try to get a better sense of how they're going to embed the transformation information into the start of sequence token for auto regressive language modeling so this images from the image GPT paper generative pre-training from pixels but they show the difference between Auto regressive modeling and something like a denoising auto-encoder masked language modeling like was used to train burt in language modeling so we're gonna inject this task information the sampled transformations T sub 0 t sub 1 T sub 2 which is the parameters of how we augment the image so within this embedding it has to encode that the cat's been rotated it's been made pink or it's been blurred out and made green so there would be many ways of doing this multitask conditioning the most naive way to think about this would be to have something like a one hot encoded vector where you have like one zero zero zero zero one zero zero what that denotes the transformations that have been selected in this like discrete categorical set of different parameters of the augmentations but probably what they do and I'm not exactly sure about I'm certain that they don't do use a one hot encoder conditioning but they probably map that one hot encoded vector that represents the categorical variable of the transformation into like a dense vector lookup table kind of similar to how like word token embeddings go into this lookup table before they get into a language model so they probably use some kind of conditioning like that to get it into the startup sequence token so the model can make use of this dense vector that represents the task is trying to perform which is modeling the density of these pixels given this transformation of the original image so similar to GPT GPT 2 and GPT 3 open AI researchers are going all in on this Auto regressive generative modeling compared to something like a generative adversarial Network framework or a variational auto encoder and the difference here is that rather than just going from X into a high resolution image through a series of up sampling convolutional layers or up sampling convolutional layers interleaved with self attention layers or some kind of variational encoder encoder decoder framework they're going to be modeling the density of this pixel given these pixels and they're doing it as if this is just a sequence of sequence thing like what's used in language modeling so at the start of this sequence is going to be the embedding of the transformation of the image and then they're going to use that to condition as they continue the generation so these are some examples from the original image gbt paper where they use something like a six billion plus parameter model to model all these images and then probably some stochastic sampling in the output that leads to how they get these different completions but now we're going to be able to get different completions by giving a different conditioning with the transformation and to generate just a rigid data like this upright cat they're going to just condition on the identity function so if it's not augmented at all they're not just gonna not condition it at all they're still gonna have an embedding for an identity transformation in the conditioning on the sequence model so the math in this picture is showing how this dog is a data dependent regularizer and we have this Omega term that we can use to weight different transformations differently than others so say we want to have a curriculum of the loss function on how much we're going to penalize the model for incorrectly modeling different transformations so modeling this rotated and pink image of the cat is harder than just this slightly rotated version of the cat so we could have this curriculum of how much we penalize the model and we can do all these different things to focus on different kinds of transformations and this different data dependent regularizer meaning that we can increase the strength of the loss function based on these different subsets of the data and the different augmentations applied so as described previously Discogs inspired by these multitask learning algorithms that will condition a model on the task it's trying to perform so it's learning how to do all these tasks these different transformation PT of X given T sampled from this family of transformations capital T and then we see this difference between the images when we apply this conditioning information compared to if you just don't tell the model at all or don't give it that embedding in the start of sequence that is being modeling this certain transformed version of the image the author's test out dis dog with a one hundred and fifty-two million parameter auto regressive model on the sea far ten data set and they show some pretty good results you can see examples of the generated images here and on this side you see a comparison with nearest neighbors from the sea far ten dataset in the inception embedding space so you would take all the images from sea far ten and from the auto aggressive language our Auto regressive pixel model and you would pass them through the inception image classifier and then when you have these vector representations from the inception model you'll cluster them based on those higher dimensional vectors and then look for the nearest neighbors in that kind of clustering space or I guess you don't have to actually cluster them you can just do the nearest neighbor lookup directly from the betting vectors but so this shows the nearest neighbors and you see in most cases it doesn't look like there's much overfitting maybe these frogs definitely these cars and then you see you know pretty unique images based on this kind of nearest neighbor in the inceptions feature space comparison so this table is one of the more exciting things about this paper that's hinting at a potential image GPG to with even more parameters and scaling up this dist aughh technique to make image gbt work even better for say image net modeling and also for representation learning in the process so this is showing that as you increase the number of parameters and the strength of the data augmentation this is rotation maybe translation colorization and the jigsaw augmentation you see this trend in more data augmentation and more model parameters leads to better performance and this is the bits per dimension evaluation metric where lower is better and then you see the baseline model is interesting as well you see without any augmentation the as you increase the number of parameters in c far 10 it goes up maybe it's over fitted to the 32 by 32 c far 10 images but so this is a promising trend and showing that this technique may be really important for scaling this up further the original image GPT paper takes on the image net data set down sample to 64 by 64 resolution images but unfortunately the dist OGG augmentation doesn't really result in massive performance gain on this data set it only really works well on the much simpler c far 10 32 by 32 data set so this hints that I may need a massive scale to make this work well on complex datasets like image that even the downsampled 64 by 64 version of image net although there's still so much diversity in that data set C 410 is something like 50,000 images compared to like a million images and image nets so the diversity is also a huge factor that would make generative modeling more difficult but you see this slight improvement from going from 152 million to 303 million parameters but image gbt is something like 6 or 7 billion parameter so it's like we have to scale it up like that as well to see this work on image net so here's another interesting finding from the paper that's interesting for deep learning models in general they find that with regularizing these deep generative models you get more out more data augmentation and then less drop out so drop out is another way of if we're thinking of data augmentation as a way of regularizing our deep learning models and preventing overfitting rivaling technique would be something like dropout or stochastic path dropout where we take a neuron the network and we just exit out as we're going through the forward pass or we X out a whole path in the network so that would be one way of looking at regularizing deep learning models another would be something like weight decay or l2 regularization x' on the magnitude of the parameters in the network but what they show is that instead of doing this you get better results from increasing the amount of data augmentation as you see this metric going down and I think this is interesting as well because especially things like the lottery ticket hypothesis and showing that these sparse and neural networks are what is the goal of this network it's not good it's not like dropping this out and then making it use the full forward pass of an entire width or depth of the network is going to make it a better network because it seems to be biased towards a sparse activation anyways so this drop out approach maybe isn't is kind of at least it also is an intuitive compared to data augmentation where we also are injecting these inductive priors through this interface with our deep learning models so I just think it's an interesting plot and I personally also think that more data augmentation is a more interesting way of regularizing these models its interpretive all compared to something like drop out and I just think it makes more sense with these models these are the results of trying dist aughh with a 15 million parameter auto aggressive model and see far 10 compared to the scaled up 152 million parameter version so this is one of the interesting findings behind this table on the smaller scale 15 million parameter model they find that the rotation augmentation works better than the jigsaw augmentation so this is an example of the jigsaw annotation you take an image and you partition it up so you have these different like tiles or puzzle pieces of the image and then you just scramble them to get another data point so you would replace this ground image with the horses back and just flip it and then add this one to the data set as well but they find that the rotation where you just rotate the images works better than the jigsaw augmentation at this 15 million parameter scale so it kind of comes down to the core view of how data augmentation is it just about amplifying the data and training on a data set of size 2n or 4n compared to n and is that what's preventing overfitting and then connecting this path along this natural image manifold that's super high dimensional so it makes sense that we would want to have some neighbors in this high dimensional space to connect the path or is it a way to inject priors in the model akin to convolutional layers so it's definitely doing both of these things it's definitely also you know amplifying the data and then also injecting priors but the Prioress thing might be the more interesting way of thinking about it because the rotating the image it creates a more semantically similar image in the data set compared to this kind of rearrangement of the image that really so especially this one doesn't really mean much so it's kind of interesting to think of exactly what is the benefit of doing this data augmentation thanks for watching this overview of dis table for generative modeling developed by researchers at open AI is going to be really exciting to see what they do with this algorithm as they scale up the number of transformations maybe integrate some new Transformer architecture and scale up to a massive amount of parameters for maybe the subsequent image GPT to this paper also covers using this technique with a pixel CNN model and a flow based generative model as well as using text data augmentations that I recommend checking out the paper to get more details about thanks for watching and please subscribe to Henry AI labs for more deep learning and AI videos [Music]

Original Description

This video explains a recent paper from OpenAI exploring how to improve generative models with data augmentation. DistAug conditions models on the transformation in a multi-task learning way. This results in improved performance, particularly with more parameters, more augmentations, and less dropout! Thanks for watching! Please Subscribe! Paper Links: DistAug: https://proceedings.icml.cc/static/paper_files/icml/2020/6095-Paper.pdf ImageGPT: https://openai.com/blog/image-gpt/ A Survey on Image Data Augmentation: https://link.springer.com/article/10.1186/s40537-019-0197-0 Training GANs with Limited Data: https://arxiv.org/abs/2006.06676 Chapters 0:00 Beginning 1:36 Data Augmentation in Computer Vision 4:20 Challenge of Data Aug in Generative Modeling 6:24 DistAug, condition on augmentation embedding 8:20 Start of Sequence token embedding 11:20 Data-Dependent Regularization and Multi-Task Learning 12:32 Examples of Generated Images and Nearest Neighbors in the original dataset 13:34 Benefits of Scale - ImageGPT-2? 15:23 Regularizing Deep Learning, Data Augmentation vs. Dropout / Weight Decay / L2 Regularization 16:53 Why does Rotation augmentation work better than Jigsaw?

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 0 of 60

← Previous Next →

DeepWalk Explained

DeepWalk Explained

Inception Network Explained

Inception Network Explained

Progressive Growing of GANs Explained

Progressive Growing of GANs Explained

Improved Techniques for Training GANs

Improved Techniques for Training GANs

Word2Vec Explained

Word2Vec Explained

Must Read Papers on GANs

Must Read Papers on GANs

Unsupervised Feature Learning

Unsupervised Feature Learning

Self-Supervised GANs

Self-Supervised GANs

Embedding Graphs with Deep Learning

Embedding Graphs with Deep Learning

Transfer Learning in GANs

Transfer Learning in GANs

ReLU Activation Function

ReLU Activation Function

AC-GAN Explained

AC-GAN Explained

SimGAN Explained

SimGAN Explained

DC-GAN Explained!

DC-GAN Explained!

ResNet Explained!

ResNet Explained!

Graph Convolutional Networks

Graph Convolutional Networks

Neural Architecture Search

Neural Architecture Search

Video Classification with Deep Learning

Video Classification with Deep Learning

BigGANs in Data Augmentation

BigGANs in Data Augmentation

Introduction to Deep Learning

Introduction to Deep Learning

EfficientNet Explained!

EfficientNet Explained!

Self-Attention GAN

Self-Attention GAN

Curriculum Learning in Deep Neural Networks

Curriculum Learning in Deep Neural Networks

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Compression

Deep Compression

Skin Cancer Classification with Deep Learning

Skin Cancer Classification with Deep Learning

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

The Lottery Ticket Hypothesis Explained!

The Lottery Ticket Hypothesis Explained!

GauGAN Explained!

GauGAN Explained!

AutoML with Hyperband

AutoML with Hyperband

DL Podcast #3 | Yannic Kilcher | Population-Based Search

DL Podcast #3 | Yannic Kilcher | Population-Based Search

Weakly Supervised Pretraining

Weakly Supervised Pretraining

Image Data Augmentation for Deep Learning

Image Data Augmentation for Deep Learning

Unsupervised Data Augmentation

Unsupervised Data Augmentation

Wide ResNet Explained!

Wide ResNet Explained!

RevNet: Backpropagation without Storing Activations

RevNet: Backpropagation without Storing Activations

GANs with Fewer Labels

GANs with Fewer Labels

BigBiGAN Unsupervised Learning!

BigBiGAN Unsupervised Learning!

Self-Supervised Learning

Self-Supervised Learning

Multi-Task Self-Supervised Learning

Multi-Task Self-Supervised Learning

Self-Supervised GANs

Self-Supervised GANs

Population Based Training

Population Based Training

Show, Attend and Tell

Show, Attend and Tell

Siamese Neural Networks

Siamese Neural Networks

WaveGAN Explained!

WaveGAN Explained!

VAE-GAN Explained!

VAE-GAN Explained!

Evolution in Neural Architecture Search!

Evolution in Neural Architecture Search!

AI Research Weekly Update August 18th, 2019

AI Research Weekly Update August 18th, 2019

Weight Agnostic Neural Networks Explained!

Weight Agnostic Neural Networks Explained!

AI Research Weekly Update August 25th, 2019

AI Research Weekly Update August 25th, 2019

Neuroevolution of Augmenting Topologies (NEAT)

Neuroevolution of Augmenting Topologies (NEAT)

AI Research Weekly Update September 1st, 2019

AI Research Weekly Update September 1st, 2019

Randomly Wired Neural Networks

Randomly Wired Neural Networks

The video teaches how to improve generative models using Distribution Augmentation, a technique that conditions models on transformations in a multi-task learning way, and explores its benefits and applications in various scenarios, including image generation and sequence modeling. This technique has been shown to improve performance, particularly with more parameters, more augmentations, and less dropout. By applying data augmentation, models can learn to generate more realistic and diverse dat

Key Takeaways

Apply data augmentation to generative models using Distribution Augmentation
Condition models on transformations in a multi-task learning way
Use auto-regressive generative modeling for high-resolution image generation
Regularize deep generative models with more data augmentation
Increase the number of parameters and the strength of the data augmentation for better performance

💡 Data augmentation injects inductive priors into deep learning models, and more data augmentation is a more interesting way of regularizing models compared to dropout

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Chapters (10)

Beginning

1:36 Data Augmentation in Computer Vision

4:20 Challenge of Data Aug in Generative Modeling

6:24 DistAug, condition on augmentation embedding

8:20 Start of Sequence token embedding

11:20 Data-Dependent Regularization and Multi-Task Learning

12:32 Examples of Generated Images and Nearest Neighbors in the original dataset

13:34 Benefits of Scale - ImageGPT-2?

15:23 Regularizing Deep Learning, Data Augmentation vs. Dropout / Weight Decay / L2 Re

16:53 Why does Rotation augmentation work better than Jigsaw?

Image Classification with ml5.js

The Coding Train