Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Analytics Vidhya · Intermediate ·🎨 Image & Video AI ·3y ago

Skills: Multimodal LLMs90%CV Basics80%Modern CV Models70%

Key Takeaways

The video discusses Diffusion models for generative arts, covering topics such as GANs, VAEs, and Markov models, and demonstrates how to build small diffusion models using tools like TensorFlow and PyTorch.

Full Transcript

foreign [Music] good morning and good afternoon um everyone uh so I hope you all know what is this what this session is going to be right uh first of all I would like to thank um AV analytics India for organizing this right and um also to the people who are here right uh we are going to understand a lot of Concepts today uh related to diffusions um and um yeah so let's get started okay so I hope you can uh see the screen right yeah so we are going to cover uh about diffusion but before we go into diffusion there is something um that we want to know uh like what is diffusion and why do we need to study it uh the reason for that is uh since we have seen the Advent of uh different generative art forms um mainly you know different art forms or different videos which seem to be surreal right uh and uh these are all generated using artificial intelligence but when we go down and look into depths about how these are created um then we'll be finding that these are nothing but simple models right uh working or Trying to minimize certain latent spaces right so I did not want to create any any PowerPoint presentations because that is very cliche so I wanted to uh Deep dive into the concept uh with some you know let's say notepad for example so that or no no not required no but I think a simple painful uh right so that uh you know you can understand the topics quite well so uh from what we know about Gans so we know that Gans are nothing but kind of a Minimax game right where you have two components generator and one of a discriminator right so uh adversarial training right we have a generator and a discriminator which tries to play a like zero sum game right where uh the generator is always trying to outsmart the discriminator right uh now we have seen that Gans produce different kind of beautiful images right we have seen generative art forms for um you know cycle can PC Gan different kinds of convolution backends uh with can right and also the different style Gans that we haven't uh have in place as of now but uh there is a drawback of Gans in the form of um the cans being too heavy for any hardware right any hardware to compute okay um and the second thing is that convergence takes a lot of time now I can ask you a question this is a quite popular question which are asked in interviews right uh so how can you say that cans uh converge way slower than any other uh discriminatory model uh you know transfer learning based model so you can say that you know since Gans always play a Minimax game right so the generator is always trying to take advantage of the log loss of your discriminator and vice versa and in that case this conflicting um ideologies of generator and discriminator always creates a kind of um lag in the convergence okay so that is why you know Gans are quite uh I would say um uh first of all they are Hardware heavy uh doesn't matter if you use uh ppus gpus it is still Hardware heavy and for large image sizes large batch sizes you are going to have a quite large uh you know a training time although the generalization is quite you know fair when we use the modern versions of Cam let's say style gun or something like that but still you know it's still uh in the initial stages if you train for 50 60 iterations then in that case uh it is still a little bit you know uh may not give you good results so we come to another part where we can think of how can we you know extend this scan concept to build something better and actually the concept of diffusion comes from there uh so um I think many of you have heard of uh Auto encoders right you may have heard of or you know may have AES okay so what can you think uh you know what do you think that aesr auto encoders uh they are a kind of a funnel dual-sided funnel model right kind of like this right where you pass in an input latent space or an input image with the hopes of reconstructing the same input image right so let me see the chat so let me Zoom my screen this is this is quite visible I guess uh for this that means so that you can see thanks for the heads up so yeah so in the case of uh Auto encoders we are trying to reconstruct something given an input latent space we are always trying to reconstruct something but there's a discrepancy in that how do we reconstruct something that we uh that has already been you know that we are getting a distribution from so let's say we have an input image but machines does not understand that as an image machines understand that as a distribution so how do we sample an output distribution provided given an input distribution right so that is a very complicated you know uh mathematical problem because you may not know uh what is the other like the noise distribution that is there so that is where variational Auto encoders come into the picture variation Auto encoders are nothing but introduce some amount of noise into your dual sided funnel model right so that the model can you know uh perform back propagation more efficiently to generate already or reconstruct the input latent space it is just like adding let's say a catalyst to your chemical reaction so you want to regenerate something right so instead of adding positive samples I try to add some negative samples into it and that is the concept behind variational Auto Imports which are let's say single variational autoencores and I found a good blog which was explaining this uh first of all this is Lillian's blog it is quite mathematical and I won't go into that details in depth uh so if you have time you can just go through it but I will be sharing uh you know all the resources out there in the GitHub uh but I will be sharing another blog which is quite good for you to understand and you know trust the concepts so what happens is that uh you know when you have a kind of a variation a lot of encoder right you are provided with an input sample okay so that input sample is let's say any image right any image that you are getting as an input right that can be that is generally conditioned as a probabilistic distribution given your hyper parameters or given whatever your hyper parameters are uh these hyper parameters are in turn mathematical cells these are considered as noise so so this P of theta x given Z is your latent space of your input right which we are trying to synthesize or reconstruct using variational automotives right now we have our P Theta of Z which is our noise sum now generating this distribution for noise it is a little bit tricky because uh you know mathematically we cannot um uh we cannot we cannot see this the reason because this is your input sample space provided conditioned on your uh let's say your noise and this is the noise distribution selecting this part is one of the trickiest problems in statistics right so what does variation Auto importers do the interesting part is they select another conditional model which is known as Q of uh Q of Phi right condition the noise variable on your input latent space okay so why this is required because since this is difficult to generate I can multiply and divide this particular uh you know value this particular value right this particular value is nothing but a distribution of your noise provided your input conditioned on your input related space and this is the entire trick which is driving diffusion this is the entire trick which is driving distance right so uh this is this is kind of a workaround an alternate pathway so let's say you in a chemical reaction you do not have the catalyst so what you do you take another reversible reaction right and you take the agent from there to use as your Catalyst in your forward reaction right so this is just uh I would say a kind of um a reversible workaround to this and um this particular equation is actually the driving force behind any diffusion model that you see okay so nowadays you have so many complex models right where you take a prompt and you uh generate an image you generate an audio you generate a video and whatnot so all of these are driving on this particular uh equation this is the fundamental equation of of any kind of diffusion networks right now when we move ahead um apart from this always remember that when we are trying to reconstruct something reconstruct an input latent space right two things happen right we have to adhere our output latent space or let's say in other words our output image as close as possible semantically with my input image right because otherwise it will be two different images right which is not the case so we need to have the output image resemble the image so whenever you want to statistical distributions to resemble one another you try to measure something called as a Divergence between the two right uh So eventually let me summarize like what we have learned so far right uh let me just write that in a notepad so that so I'm just telling you the concept so that it will be easier for you to go through the flow so uh in this case two things we have learned we have learned in the case that Gans is is primarily a zero one sum game right which can be used to generate a good amount of images and what not uh right but at the expense of your Hardware and also your training time right so this is uh that is why Gans are you know although they are popular right but you have to trade for an extensively large period of time so what are the Alternatives right so Auto importers are nothing but a kind of a funnel-sided structure used for reconstructing your images reconstruction of input latencies I'm writing as LS right so uh this photo encoders uh also need to in order to efficiently train them what we do is we plug into some uh you know let's say the third part is vae where we plug in some noise we in input some noise in your conditional variable in your conditional probability right conditional probability of your input latent space to generate or to generate the output sample which is as close as possible to the input sample so in other words we add some noise generation generated on your conditional probability let me write as conditional probability on iron piece on an input sample to generate your output right so and one of the important conditions is that the output should resemble the input so that means the Divergence between your output and your input should be as minimum as possible right and I'm sure some of you if not many of you have heard about one of the most popular Divergence algorithms used in uh statistical training uh is that kale Divergence right uh I guess this is when you have two uh statistical distributions you need to have uh let's say p and Q you you can measure the kale Divergence constraint by using this equation and uh just like I said the way single-sided variational Auto encoders Work by introducing noise right uh conditioned on Q of your noise provided your input sample by multiplying and dividing it right which we saw and that is the reason why this equation is very important and this is actually known as your reconstruction loss right when you apply log right you have a reconstruction and um uh then comes your um I would say iterated uh Auto encoders iterated vs so there are other terms like you know hierarchical variational Auto encoders which is nothing but iterating over different latent spaces now this is quite interesting when you consider different latent Space by what I mean is that you have an image of let's say three channels RGB so all of those three channels are different latent spaces okay so when you are applying a you know uh hierarchical or I would say uh looping variational Auto encoder right you can have a kind of uh a joint probabilistic distribution which follows this pattern just like I told you uh multiplying and dividing with with your Q right um this is actually the diffusion centralized diffusion equation for your um you know Loop variation a lot of encoders right uh q p by Q right always remember that it is Q of Phi right P of theta by Q of Phi okay we are trying to sample your output in such a way so that your it resembles your input right now you might wonder if we are generating the inputs outputs to resemble the inputs where does the generative ability come from like when I prompt a text how does the new generation how does the new textual generation comes into the plane okay so the answer to that question is whenever you are writing a text right whenever you're writing a text you need to encode that text right you need to import that text to generate what your embedding space for your text now what happens is that you take a vision model right you take a vision model let's let's say a very simple resonate 50 or is that uh 160 something like that right and you generate uh let's say you have that trained risk net 50 on your C4 or whatever data set you have and you have the embedding spaces for your different images so what happens is whenever you type in a text let's say an image of a let's say a puppy right so what it will do it will look it will first of all generate the embedding space for that text that means you know an image of a puppy right it will compare that embedding space with the visual embedding spaces generated from resnet 50 trained resonate 50 and take up the one which has the closest distance Gap right which is the closest distance Gap that distance Gap can be anything it can be Gale Divergence it can be sinusoidal loss or it can be any kind of inequalities right it can be anything so that comparison of your different embedding spaces gives you something which uh which you actually see on the screen as a generative art form right so this is the concept behind using this diffusion to you know blend in your textual embeddings to blend in your image embeddings right and then you know uh converting the distance reducing the distance or finding the closest distance measure right any kind of distance measure Divergence measure or your standard distance measure anything can be done right to generate your final output and you might wonder that when is this equation is used so the resonant 50 which you trained on your sifar or whatever data sets right it will try to generate latent spaces for certain intervals of time why because whenever you write in a text right that text generates an embedding space that is fine what the image Will Do Right image embedding will do it will take each word at a time right for time step and it will try to generate embedding spaces and as you know generating embedding spaces from any kind of vision model is quite expensive it's time consuming it is Hardware uh you know constructive so you need to find an efficient way and that and that is where this diffusion process comes into perfect okay so uh the images that you see on the internet right whenever you type in a text I think which is taking many of the social media by by storm uh that that happens effectively with the help of diffusion now um so that is the concept let me you know that is most of the theory that is that is that you have to learn right uh let's say looping uh vaes Plus it's nothing but diffusion there are two things regarding looping vas the first important thing is a forward path so diffusion has a forward pathway okay uh it is just like an image over here like let's say this one you are generating your uh images for successive timestamps we're generating trying to generate your images for yourself timestamps now how these successive timestamps are there these are coming from your any kind of input temporal input sample what are temporary temporal input samples temporal input samples means input samples which are separated by time this can be your let's say your textual image this can be your video image your audio image and whatnot so now you know that provided an input input text let's say textual image a textual sample right you that is actually the the vision model is actually generating uh different um I would say um outputs latent spaces for every word that you see for every time step okay this diagrammatic representation of your um you know let's say your diffusion models diffusion VA is okay so your P Theta is your reconstruction right Q is your inference inference means what is going to be the genetic output so as you know that we are trying to get the inference or my output from your input samples but during this is since this is a standard training process right so you will be back blocking so during back propping we are taking the samples in the in a reverse manner right so we are taking samples from the x0 that is your first time step input latent sample space right and we are trying to generate for each samples you know success timestamps x t minus one x t x 2 plus 1 x t plus two and so on so this successive uh way of uh generating something is also known as Auto regression so taking previous input sample generating a new sample and so on right this is also known as Auto regression so you can say that diffusion uh variational Auto encoder so generalized diffusion models right are Auto regressive models for generating new samples okay so uh this is the entire concept behind a probabilistic model so these are linked inside my repo which I have okay all right so diffusion models okay now whenever we try to generate some diffusion models right uh always remember that we use a gaussian sample space recommended to use a gaussian sample space to try to get a good amount of um you know reconstruction reconstruction right uh that this this question's sample space can be introduced in your top short tensorflow layers when you build a torture tensorflow um simple tense or that kind of layers you are trying to initialize them with simple you know question or normal distribution right so this only adds to how you know your diffusion process works so just uh you know there is a full you know walk through or a detailed a derivation of that but I don't but you know you can read it at your expense but this is the idea now um I've also said about you know uh this conditional flow right which which you have seen over here even if it is here as well not here uh this one this conditional flow is also known as Markov chain okay this conditional flow is nothing but a mark of sheet so what we have learned this is an auto regressive model which is fine uh it takes the conditioned inputs right to generate your conditioned outputs right and this is also a Markov chain posterior Markov chain so all of these are posterior probabilities okay and these are priors these are known as priors so this is nothing but a a simple Mark of chain which is dependent upon your latent input space the producer output space right so that is what a diffusion model is having loss if an embedding C5 is an inverse very good question so uh since Jensen uh so not only Jensen so Jensen is give is for the lower bound right so whenever you have a Hamming gloss any kind of loss function right uh so um Hamming loss means a simple I guess point-to-point loss uh difference right so in Z of Phi and the inverse function of Z of theta so uh you know it would not have any effect because uh the standard VA is in the Jensen inequality is just for uh keeping a kind of an elbow right so um let's say a lower bound right so your um constraint when you apply log on the tools side of your Jensen inequality right you are effectively uh just nurturing another concave function because your log is always concave always remember that so this just an inequality is nothing but keeping a lower amount of your uh of your input samples given your noise what you add so this vaes will be able to reconstruct uh properly there won't be any issues now it does just like I said it does not matter what kind of loss you use uh what kind of distance measure you use you can use any kind of distance um whatever you know of even sinusoidal will work so uh in the real space I'm assuming you are talking in the real space so in the real space there won't be much of a difference the vas will function uh you know as as normal okay uh I hope that answers your question very good question so uh now let's go into some coding part right so uh this is a sample which is there in my uh refer right actually this is a very uh critical concept uh and topic right and um I know you will have a lot of questions and a lot of doubts when you after this session because one hour honestly is it's very small for understanding diffusion concept but I hope you got an understanding or a global view of what diffusion is always remember that uh in in in in chemistry you know diffusion right diffusion and osmosis what does diffusion do right so uh any any kind of fluid at a higher concentration fusing through any kind of orifice with a constant velocity right to a lower concentration space right now the same thing happens over here instead you are diffusing your final latent space at each time step right at each time step of your input sample uh to produce your corresponding output okay so so think of it in that manner it will be easier to understand um yeah so uh this is a repo which I created right uh data our diffusion this has been taken from folk from you know F4 is uh uh diffusion uh tensorflow right implementation with certain changes which I mean so that you can understand it better now before I go into the code um you need to understand something um you you have understood the concepts and everything but let me go through the flow okay of what this is happening the example will work but so let me go through the example uh so in this case what we are doing so there are different kind of diffusion models now when we talk about diffusion models let me create a chart for you so that it will be easier to understand so let's say this is diffusion concept d okay now inside diffusion concept you can have different kind of two broadly classified sections one is uni model and one is multi-modal models okay both of them are using diffusion concept what comes under a uni model is any kind of uh uni Model Auto encoders vaes okay any kind of uh quantized Gans like let's say one quantized scan is uh you know any kind of variational form of quantities or vqas in other words right so this style gas I'm writing as is okay style cast so these are all your uni model uh kind of uh diffusions right now when you come inside multimodal models right you already know that there are many multimodal models right you have clip you have Dali from openai you have so many things you have imagine from Google right all of these are diffusion multimodal models okay so what are multimodal models you have two different inputs two or more input samples or latent spaces right and you are trying to do an output of two or more of your output sample spaces right so in that case you are going to have certain models like clip multi this includes in general this includes any kind of multimodal Transformers mmts so I hope you know what Transformers are quite popular I think every it is uh by default everyone should know or should have you know should read attention is all you need paper um uh so that is actually what is driving today all the models that is there so any kind of multi-modal Transformers that you see uh you know let's say speech audio which you know Video Vision whatnot okay uh so clip Dolly and your the models which I mentioned right Glide from opening I imagine uh text to image that we're going to see and uh what not so all of these multimodal models are uh you know are part of uh you know diffusion process okay so this is a broad idea now we are going to understand this part today we have already understood I had a high overview high level overview of what diffusion is but now we are going to understand this entire flow okay so in this case we are going to look into text to image what is text to image texture image is a popular model by Soul right uh who said that you know if you can if use um your output latent space by taking samples from your input texts right at each interval of time or time separated you can generate any kind of art forms right and what is going to be the backbone of your image your backbone of your image net is going to be your audio image network is going to be any kind of Transformer Transformer based it can be Vision Transformer it can be uh you know a residual block with cross attention there's any kind of variation of your uh you know um any kind of variation of your image model with transforms right so he went ahead with unit is a vision model or a Transformer model which uses something called as a Swiss Transformer okay so we'll be going into that code uh just two three minutes after this uh so it uses the Transformer to generate that embedding space for each time step and this text to image what it does is it tries to apply uh from the input latent space right from the input embedding space so why you have the P of theta right you have to generate that Q of theta that noise Q of Phi and Q of Phi in the numerator and the denominator it generates that for each every time step and it tries to give you an output sample okay now uh this is the interface uh you know I've kept it the same as uh if police so that it can be easier for you to understand you can just go through his repository as well I have a link to that as well but um you know let me uh go through the code okay so let's let's first start with what an auto encoder looks like right a simple Auto encoder so um just like I said an auto imported has an input and an output sample okay so uh and what you do with the logic of the input is up to you can just keep a simple Point net and you can just you know say that it is fine or you can just add some simple attention mechanisms on top of it as per your as your choice so I added a attention block uh to the auto imported model okay so um by uh you know attention is nothing but uh always remember you have three matrices right which you want to multiply right so you multiply your query and your key matrices right you perform some mathematical optimization on it like let's say square root or something like that you divide it by your uh you take your soft Max of that and you divide it by your valence right so the reason why we do attention is to focus just like we do in text right so that we can focus on small parts of your latent space efficient right and these query q and value matrices are very important to generate a good latent space or written representation over your standard convolution model okay so that is what this call function does is it is just a basic attention block you see uh Q multiplied by K which is your query and your key matrices you do some of you know Square rooting generally it is you know root over right you apply the soft Max and then you divide it by your you know one by V right so that is generally the way how it how it happens in the case of attention now what I'm trying to do here is I'm trying to create a kind of a auto encoder right which does attention and I'm going to use this Auto encoder to generate my reconstruction sample right I'm going to use this Auto encoder for my reconstruction sample now um let me go back a little bit so that I can make you understand like where does this come from let me see if they have this big equation so yeah so this is actually the loss of your uh big loss of your uh entire uh you know what should I say what we are studying now diffusion models okay this is actually the loss of your diffusion models okay so this part log p 0 x x x naught given X1 this is your reconstruction loss okay these two are known as your KL Divergence terms uh the reason because uh we are we are as we know that we are trying to generate our successive time steps right uh using this particular equation that we have seen um where was that equation that 12 dual integral equation yeah this one right so when you add this over successive time steps you are going to get to that equation when we apply log okay so uh most of the log terms are used for KL Divergence so that your output remains statistically uh just a variant of your input not a fair not having a very good difference right uh but your input but your input sample must be there that is known as your reconstruction loss which is very important so this this particular resnet block uh with the attention is nothing but generating your proper reconstruction okay so as you can see that this is my model which I have developed nothing but a quantity having intermediate attention blocks resonant blocks uh just one after the other quite a big model but you know just because I have written it in this manner it looks quite small okay now uh the next part is having you know uh the clip model right the clip model clip is is a multi-model model which uses the diffusion okay now as you know that the clip model is is a kind of a multi-modal model that means it will have it will need to have both your textual embeddings space as well as your image embedding space okay or any kind of embedding space that you want video embedding space or audio embedding space what not okay so I'm going to build it as a layer right I think uh what he has done uh Frank if Cola has done similar things uh so here also the clip attention is a little bit different although it is the same but there are many tensor shipping operations these shipping operations are done because in the case of teenagers you have different um you know I would say ordering okay generally this is done for nchw formatting nchw nchw is a format which is used in tensorflow by default when you are running any kind of tensorflow mod image models okay there are other models as well other variations as well I think CN uh CH and W and something like that so those are a little unoptimized okay so keeping that in mind uh the the clip attention uses that size heads number of attention heads okay that means multihead attention how many number of uh how many number of heads are there in my multihead attention generally it is 12 right you can do 24 48 whatever and your model complexity will go on increase you have your sequence length sequence length is nothing but your uh let's say um when you when you encode your text right you need to have a sequence how much you are going to encode it right and your head Dimension so for the in the case of your input images you need to have a dimensions of your attention heads it can be 64 plus 64 128 plus 128 and one okay so um in this case this should be actually in the case this particular line 13 has been taken from Vision Transformer okay in Vision Transformer when you are building Vision Transformer your embedding Dimensions should be completely divisible by your number of heads um it's not that you know you can do seven six seven seven you know 780 or something like that but it is recommended that you do not do this okay uh because you know what happens is whenever you are concentrating attention on your vision Transformer it happens patch by patch okay or any kind of visual Transformer models right it happens patch by patch okay batch by patch means that if there is something over left let's say if you do a module lesson something is overlapped in that case that patch will not be attended okay it will waste up your new time it will waste up your GPU time okay during computation okay and you will get an inefficient outcome so uh always take an absolute value which which you can divide by your number of effects okay this is just a recommendation um and uh you can follow this you can look up through any Vision Transformer codes which is there online anyway uh torch tensor for doesn't matter what package okay but this is the ordering which is generally followed you know you have your batch size you have your n heads you have a sequence length and your dimensions of your heads okay um okay so let me see if you have any questions uh so uh birth is not multi-modal but is unimodal so when we come to language models right large language models uh bird is a discriminatory uh model uh so in that case um I it is not essentially a diffusion model on the other hand gpt3 GPT variations can be thought of as diffusion models uni model uh language diffusion models okay a bird is not a diffusion because in word you are doing conditional uh you are not doing it is actually a discriminating model it is not Auto regressive or it is not generating anything so you can uh do that okay I hope that you understand what I'm trying to say is that this is actually a very big field and I want to make sure that everyone understands it uh understand just the you know basis of it so that it will be easier for everyone to whenever you go through other sources there are some good sources which I will be adding to that repo right where you can go and move notebooks okay uh where you can understand these Concepts hugging face is always there they have a separate entirety module for diffusion they have created a separate module voltage diffusers or something uh so you can go through that as well um okay so uh just as I was saying uh in clip the attention is a little bit different because this is not a basic attention that we are using we have to adhere to what uh the ordering is batch size into the number of heads right multiplied by your sequence length and then your dimensions of your heads the patch actually okay so um and then you have just like you have uh any kind of Transformer layer you need to have a important module okay and then you need to have your text Transformer model okay so let's understand what the inputer module is so as you know and I think you might have heard about the attention is all you need paper and you have seen the word architecture just a few of the Transformer models right so in that case you always see that there is an input and on the left hand side of the image and the decode on the right hand side of the image okay so in the case of your burden of and your discriminatory models any kind of word variations you generally have encoder Stacks you do not have decoder Stacks right uh in the case of your GPT networks any kind of GPT language models you have decoder stacks and not encoded stacks okay so always remember that uh the difference lies in the way in how they are the the token is imported from the left to the right or from the right right to the left okay so for bird you have only inputer stats okay so in this case for clip we are going to focus on this part only it is similar to uh you know how the architecture is similar to you can think analogically that it is like a discriminatory bird variation for multimodal models okay so in this case we are going to have layer normalization followed by the clip attention and their normalization and we are going to add certain dense layers uh feed forward dense layers on top of it okay so uh that's just a clip clip model now always remember that clip has two different facets one is your image for your image as you saw and one for your text and so when you have something for your text you need to have an encoder for your text as well okay so this is where your text Transformer comes into the picture but before that you need to have your text embeddings to pass it into your text transform so for that you need to have your uh embedding layer so get us already has this embedding here you need to generate that embedding layer right uh you need to have your positional embeddings and you need to you know uh generate uh your uh you know whatever uh um you know encoder you can pass it through an encoder model so this encoder module is the common for both your vision and your image so if I summarize the clip model looks like this so you have a clip clip Transformer right centralized module right with Transformer with attention enabled right so you are passing in your I'm writing V for vision so that means image samples right image samples image embedding spaces and you are passing it passing your what uh your text embedding space okay so uh this when you combine right you can generate a combined space for both your you will actually return two different lockets a log it's for your text under log it's for your images now what you do with this logins is up to you you can do text generation following uh you know for given an input image or you can write a some sample uh text and you want to generate an image and uh this is also simple kind of an uh you know clip kind of a model which is there okay so the same thing is written over here in a simple manner now when we come to the diffusion part so the diffusion model is where the centralized everything is present so here you are calling your uh here again you have a kind of an attention mechanism the reason is let me show you so this this is the clip model right so before that we have seen your auto encoder model so I'm writing as a over here Auto imported with attention enabled for deconstruction I'm writing it as P of 0 right the Reconstruction model so I have this separate model I have this separate model now I need to combine this right I need to combine these two for each successive time steps so yeah so I need to combine these two for each success time steps so that means this is where the diffusion model comes into the play so from take from output from here take the output from here sample for summation of t iterations sample them for TI Traditions so T minus 1 T minus 2 T minus 3 and so on what you can do you can just keep a small neural net like a simple dense net over here right or you can do any kind of variations on the test Network here I have done a simple uh rest net with modified attention okay you can keep a simple translator so not an issue so what you need to do is this will be your driver this will be your driver so which will coordinate both of your embeddings dual embedding spaces from here and your auto encoder construction embedding space reconstruction embedding space so here is your reconstruction part which is coming into the plate and uh these are your text and image latent spaces right you plug all of them into it and it will try to optimize this loss function which we which we were seeing so this big loss function which we were seeing yeah this or in other words uh this one uh this one okay so this is from the auto encoder these two parts are coming are jointly coming from your dual embedding spaces which is being squashed or processed by your diffusion process all right so that is what diffusion is doing in a nutshell okay so this is just a simple attention block same as what what I have done before right and in this case the original paper by so by by Deacon's soul and begins right uh it is it is mentioned here or it is over here as well yeah soul and begins time it is it is there so they use a unit architecture for the diffusion process uh unit model is nothing but this this kind of architecture where you have your input blocks uh your resonate input blocks and a spatial Transformer this spatial Transformer is very simple it is nothing complex it is just a simple cross basic Transformer block which has the query event value uh right with GE glue uh glue activation is a different kind of an activation mechanism right geglu okay it is also there in the layers so um it is just uh it is okay it is kind of a complex activation uh so this is actually if Cola has written this uh it is a little bit complex to understand because you are trying to get your not only trying to get your daily outputs but also your other input samples as well so it is a little bit complicated uh but if we go back so this uses the giggly optimization for uh for your basic Transformer block right this basic Transformer is nothing but your cross attention having this cross attention mechanism okay so in this case the spatial Transformer is nothing but having this basic Transformer block iterated over uh I think uh you know 12 times right so for the unit they have this special Transformer blocks one after the other um and they have a middle block which is nothing but press net special Transformer and again resonant block and this goes on okay so ideally this is also known as a switch Transformer because they use a switch activation function for the unit right but I'm going to keep that aside for now uh this unit model is your diffusion model that I was drawing about so you can do you can do actually a simple resonate as well so it won't be an issue so if you if you remove your uh spatial Transformer blocks if you remove your padded con you can keep added cons you can remove your special Transformer blocks and still it will be a kind of a simple resonant variant of your diffusion model okay or you can use Vision Transformer any kind of visual trust so you can you can do anything with it so any kind of plugins so don't think that only I have used unit s so that is why I'm using unit but you can do anything okay and what you sample from this log uh entire loss right the combined loss that you get you try to optimize this loss this big loss I'm not writing it here because it is quite big to understand to write it down uh but this is the loss that you're optimizing okay this is the loss that you're optimizing and when you optimize this now what you are doing think like this when you are optimizing this loss you already have the Reconstruction that means that whatever text I'm writing text is my input latent space right so my text what should what am I writing your auto encoder should take should take that into mind that my output image should resemble my input image from input text it should not be diverging from it okay the clip Transformer has the responsibility to actually generate coherent embedding spaces for your text and images and the diffusion model that is your unit model has the responsibility to merge them in such a way so that this function can be optimized in real time okay the Reconstruction loss should be taken and also the clip embedding spaces should be taken okay diffusion does this unit does this for every successive time step so this is just the diffusion model now stable diffusion is the interface where this is where this is actually happening which is calling this function this generate function right for number of time steps that you want to uh want to have right how many time steps you want to iterate on uh how many how many steps uh what should be your generating principle right so all of those are if you want to generate you know let's say um I forget the term for what that is called um how you know what is the term that you call like you do a generation like this you Branch uh like this forgetting the term uh it's coming to my mind but not to move off so yeah so this kind of branching that you see in generative GPT or any kind of generative sampling you can uh you can do over here as well okay so this is done for successive iterations how many iterations or time steps you want to you want to run your uh diffusion model Unit Model okay this decoder is nothing but your unit decoder okay this unit decoder will generate the final output lockets okay and then you can do uh you know you add your noise and you do uh you perform that kind of optimization actually the unit model is itself capable of generating a good output latent space provided your clip embeddings and your uh you know input sample from your auto encoder okay so that is what is happening in a nutshell and when we go to uh this part is that this is the actually the interface this text to image is is the interface the image icon image to it right and I'm calling this generator.generate right this table which is coming from this table diffusion or this particular function uh this generate function you can take a look it is there in GitHub right how it is happening so the flow of it is happening just giving an outflow right uh first you encoder import your uh you know prompts prompt means your text right and you take your image tensor right and you pass it into your uh where where was that that that deported the self.decoder is your effective unit decoder so you can think in in such a manner that uh if I go to this image this is actually an encoder Transformer right I'm writing it as e and it is taking from your deconstruction loss from your auto encoder and it is doing the deconstruction or your decoding over here from in the unit or any kind of model that uses a diffusion so this diffusion part or this diffusion model is doing your decoding okay to give you a final presentation all right so this is the idea behind how diffusion models multimodal models work okay if you if you change the number of steps the T over here the T in this equation the summation T will reduce and if you reduce it will take less amount of time to change so it is quite obvious so this is image two image a text to image okay trust me so this I write Bin Laden dancing with Fitness just a kind of a fun I'm not sure what they have created not sure but uh seems quite odd to me you know uh even if you if you try to do for more than 50 60 iterations right maybe 150 iterations you will try to get something which is more believable uh so this is one model X2 image you have several other models which I have added over here uh which is which is from open AI right text to image and these are all having the same code base these are all having the same concept that I have code so Glide uh the text to image that I have talked about this is nothing but Glide uh you know they have given a separate name for it because uh you know they have trained it on um they have they have changed the Reconstruction loss you know uh to make it more photorealistic to make it more realistic rather and uh they are more or less they're using the same architecture they're using they have a clip guided which is nothing but using the clip architecture that they have written so you can go through this part of the code as well written in pytosh uh and uh this is actually the light paper which is the source of your text to image so applied is the source of your text to image clip generation uh you know whatever multi-model models that you see okay and um if you get time or if you have a scope uh try to understand um why gaussian noise is used okay because that is very important concept mathematically why gaussian noise is good to have concept that if you if you know that okay why not any other why cannot I have a position distribution and only a simple question distribution right so uh you can you can you can it is there in the uh you know readmies and in the things if you have the mathematical attitude I can just understand okay um so that is what I wanted to cover mainly uh since time is running out and uh uh you know I also wanted to cover some question answers uh because it is very important that you understand uh so first of all this diffusion model is not not at all beginner friendly the reason is because as you might have understood diffusion is based on a uh you know a probabilistic model right a probabilistic mark of chain model so uh and and taking that part from your uh Markov model Markov chain Theory into your code right and and not only building a statistical model but actually you are building a very complex neural network for multi-model models so in that case it becomes difficult to understand now the first way to tackle this is um I would say go through my code go through my code or go through fols code right understand like the flow how it is happening Follow by flow uh first text to generation text to image is called which calls in my stable diffusion right stable diffusion is a driver interface which calls your diffusion model right so whenever you learn about diffusion models try to think it from a wide angle so this is my driver diffusion model right unit or something like that and it takes your embedding spaces from any multi-model model I have used clip it can be any multimodal model it can be um dolly or it can be anything right it can be anything so if you replace this multimodal model with a single Transformer model let's say a language Transformer model let's say bird or you know uh um GP Internet but or Albert or you know T5 uh you know right so in that case it will you will have only text to text so textual kind of generation using diffusion right so you can do anything with it so these are all pluggable parts so whatever modalities you choose right you can plug that in I just took a sample example of multimodal models because that is trending as of now and I wanted you to understand this set of broad concept because the code is available everywhere right so um there are again certain box of code which I have omitted because uh it is very complex so you can take time to understand it but the first way is to understand what have the view go through the blocks and have the view of that particular okay the the broader View and then and then go through the code one by one one function at a time and you will be able to understand at a much uh you know easier way okay I hope that makes sense uh could you show that paper again uh yes soul and deacons dickenstein I think so uh it is there it is there in the in the is there in the link in the paper so you can go through them so there is one by Soul at all and there is one by four at all so boy at all uses denoising Factor okay so uh that is beyond the scope right uh so yeah so is there any other questions which you have I hope I answered your question on this part because I know that it is a little bit difficult to understand uh first of all the workflow I would suggest is if you are new to attention and Transformers try to understand attention and Transformers try to understand clip then try to understand your uh Auto importers and then try to understand diffusion models stable diffusion equally capable yes yes yes one answer one short word answer is yes the reason why is because the initialization function that you use I told that version you use gaussian right you use version it does not matter what optimization you use let's say you use Fischer optimization in input space so in that case it will run fine okay so uh stable diffusion that is the reason why stability has been added because you are kind of modulating a Markov chain with gaussian space as your simulator right so it is very uh I would say uh it will it will it will run for fall right it is equally capable in every space every real space that is why it said that it is Cable in every real space right so uh I hope that that clears this question thank you have a great weekend bye [Music]

Original Description

In this DataHour, Abhilash will explain about the flow of generative latent state representation from GANs, VAEs to Diffusion methods. Since Diffusions are based on the Markov model, he will be building small diffusion models to make you understand latent space representation from images. He will also demonstrate to you how to analyze contemporary multimodal models such as Dall-e/CLIP/unCLIP/GLIDE/Imagen in the context of Diffusion models to replicate and create "Generative AI" which is taking the NFT world by storm. Do subscribe to Analytics Vidhya channel & get regular updates on videos: Stay on top of your industry by interacting with us on our social channels: Follow us on Instagram: https://www.instagram.com/analytics_vidhya/ Like us on Facebook: https://www.facebook.com/AnalyticsVidhya/ Follow us on Twitter: https://twitter.com/AnalyticsVidhya Follow us on LinkedIn:https://www.linkedin.com/company/analytics-vidhya

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Analytics Vidhya · Analytics Vidhya · 56 of 60

← Previous Next →

The DataHour: Data Science in Retail

The DataHour: Data Science in Retail

Analytics Vidhya

The DataHour: Anomaly detection using NLP and Predictive Modeling

The DataHour: Anomaly detection using NLP and Predictive Modeling

Analytics Vidhya

The DataHour: Energy Data Science Project from Scratch

The DataHour: Energy Data Science Project from Scratch

Analytics Vidhya

The DataHour: Explainable AI Need and Implementation

The DataHour: Explainable AI Need and Implementation

Analytics Vidhya

The DataHour: Google Cloud AI/ML

The DataHour: Google Cloud AI/ML

Analytics Vidhya

Prediction to Production in Machine Learning #machinelearning #prediction

Prediction to Production in Machine Learning #machinelearning #prediction

Analytics Vidhya

Practical Applications of Data science in Ecommerce

Practical Applications of Data science in Ecommerce

Analytics Vidhya

How to tackle Overfitting?#machinelearning #overfitting

How to tackle Overfitting?#machinelearning #overfitting

Analytics Vidhya

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Building Data Pipelines on GCP #googlecloud #datapipelines #data

Analytics Vidhya

Hands-on with A/B Testing #abtesting #datascience

Hands-on with A/B Testing #abtesting #datascience

Analytics Vidhya

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Efficient Implementations of Transformers #transformers #cnn #machinelearning

Analytics Vidhya

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Modern Deep Learning Architecture #deeplearning #architecture #deeplearningtutorial

Analytics Vidhya

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Key steps for Designing Artificial Neural Network (ANN) for Image classification #machinelearning

Analytics Vidhya

5 things you should know about Azure SQL #azure #sql #datahour #datascience

5 things you should know about Azure SQL #azure #sql #datahour #datascience

Analytics Vidhya

AI & ML in the Automotive Industry #machinelearning #ai

AI & ML in the Automotive Industry #machinelearning #ai

Analytics Vidhya

Building Machine Learning Models in BigQuery

Building Machine Learning Models in BigQuery

Analytics Vidhya

NLP aspects in Telecommunication Industry

NLP aspects in Telecommunication Industry

Analytics Vidhya

Practical Time Series Analysis

Practical Time Series Analysis

Analytics Vidhya

Fundamentals of Quantum Computing

Fundamentals of Quantum Computing

Analytics Vidhya

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

A DAY IN THE LIFE of a Data Scientist (From waking up to working on algorithms)

Analytics Vidhya

Classification Machine Learning Model from Scratch

Classification Machine Learning Model from Scratch

Analytics Vidhya

Knowledge Graph Solutions using Neo4j

Knowledge Graph Solutions using Neo4j

Analytics Vidhya

Model Guesstimation (MLOps)

Model Guesstimation (MLOps)

Analytics Vidhya

ETL Pipelines in Google Cloud Platform

ETL Pipelines in Google Cloud Platform

Analytics Vidhya

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Key steps for Designing Convolutional Neural Network(CNN) for Image Classification

Analytics Vidhya

Getting Started with AWS EC2 #amazon #aws

Getting Started with AWS EC2 #amazon #aws

Analytics Vidhya

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

How to Use Azure NLP and Graph Databases for Intelligent Knowledge Mining

Analytics Vidhya

Certified AI & ML BlackBelt Plus Program #shorts

Certified AI & ML BlackBelt Plus Program #shorts

Analytics Vidhya

Visualizing Data using Python #machinelearning #visualization #python

Visualizing Data using Python #machinelearning #visualization #python

Analytics Vidhya

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

DCNN for Machine RUL Prediction using Time-series Data #timeseries #machinelearning #datascience

Analytics Vidhya

M in ML stands for Math & Magic

M in ML stands for Math & Magic

Analytics Vidhya

An Unsupervised ML approach using Clustering

An Unsupervised ML approach using Clustering

Analytics Vidhya

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Customizing Large Language Models GPT3 for Real-life Use Cases #gpt3 #datascience

Analytics Vidhya

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Model Parameters vs Hyperparameters - Techniques in ML Engineering #machinelearning

Analytics Vidhya

Practical MLOps #mlops #datascience

Practical MLOps #mlops #datascience

Analytics Vidhya

Data Engineering with Databricks #dataengineering #databricks

Data Engineering with Databricks #dataengineering #databricks

Analytics Vidhya

Multi-Objective Optimisation

Multi-Objective Optimisation

Analytics Vidhya

When Airflow Meets Kubernetes

When Airflow Meets Kubernetes

Analytics Vidhya

Analytics Vidhya

Learn Convolutional Neural Network for Image Recognition

Learn Convolutional Neural Network for Image Recognition

Analytics Vidhya

Extracting Value from Data

Extracting Value from Data

Analytics Vidhya

How to measure Marketing Channel Effectiveness

How to measure Marketing Channel Effectiveness

Analytics Vidhya

Transforming Lives | Data Science Immersive Bootcamp

Transforming Lives | Data Science Immersive Bootcamp

Analytics Vidhya

Stock Market Analysis - AI driven approach

Stock Market Analysis - AI driven approach

Analytics Vidhya

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Become a Data Engineering Professional in 2022 | Future Trends + Skills Required

Analytics Vidhya

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Ensemble Techniques in Machine Learning #machinelearning #ensemble #datascience

Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

The Power of Visualization | Tableau Full Course | Analytics Vidhya

Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Demand for Data Engineers is on the Rise | Data Engineer | Analytics Vidhya

Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Data Visualization in Data Science | DataHour | Analytics Vidhya

Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Role of Optimization in Machine Learning & Deep Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Solving any Machine Learning Problem | Approach and Steps Involved

Solving any Machine Learning Problem | Approach and Steps Involved

Analytics Vidhya

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Topic Modeling Explained with Implementation | Using LDA in Python | DataHour by Arpendu Ganguly

Analytics Vidhya

Data Engineering in E-Commerce | The Best Case Study

Data Engineering in E-Commerce | The Best Case Study

Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Introduction to Classification using Azure Machine Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Introduction to Federated Learning | DataHour | Analytics Vidhya

Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Diffusion Models for Generative Arts | DataHour | Analytics Vidhya

Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Master Google Analytics in 1 Hour | DataHour | Analytics Vidhya

Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Learn Hypothesis Testing | DataHour | Analytics Vidhya

Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

A Practical Approach to Kaggle Competition | DataHour | Analytics Vidhya

Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Making AI work for Business | DataHour | Analytics Vidhya

Analytics Vidhya

This video teaches how to build diffusion models for generative arts, covering topics such as GANs, VAEs, and Markov models, and demonstrates how to use tools like TensorFlow and PyTorch. The video provides a comprehensive understanding of diffusion models and their applications in generative arts.

Key Takeaways

Implement an autoencoder with attention block for image reconstruction
Build a ResNet block with attention for intermediate reconstruction
Use the diffusion equation for successive time steps
Build a clip model as a multi-modal model for textual and image embeddings
Optimize combined loss of reconstruction and clip embedding spaces
Call generate function for successive time steps
Use unit decoder to generate final output image

💡 Diffusion models can be used for generative arts by combining text and image embedding spaces to generate new content

🔒 Pro feature: Ask AI to explain this lesson →

More on: Multimodal LLMs

View skill →

INSTALL NEW UNCENSORED FaceGen Ai WebUI LOCALLY in 1 CLICK!

INSTALL NEW UNCENSORED FaceGen Ai WebUI LOCALLY in 1 CLICK!

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Related AI Lessons

FREE AI Sin City Photo Generator — Turn Any Photo Into High-Contrast Noir Art (2026)

Transform any photo into a Sin City-inspired high-contrast noir art using a free AI generator

Google makes Gemini’s personalized image generation free for all US users

Google's Gemini personalized image generation is now free for all US users, allowing them to generate images informed by their Google data

The Next Web AI

Gemini’s personalized AI image generation is now free for U.S. users

Gemini's AI image generation is now free for U.S. users, allowing for personalized images based on user interests and data

WebP's Compression Secret: How a 1MB PNG Becomes a 200KB WebP

Learn how WebP compresses images more efficiently than PNG and JPEG, and why it matters for web development

Dev.to · swift king

OpenAI Kills Sora then Descends into Chaos