Xiuyu Li - Q-Diffusion: Quantizing Diffusion Models
Key Takeaways
Xiuyu Li presents Q-Diffusion, a framework for quantizing diffusion models to improve efficiency in image generation, and discusses techniques such as post-training quantization, calibration data sampling, and split quantization to mitigate challenges in quantizing diffusion models. The presentation covers the application of Q-Diffusion to stable diffusion models and its ability to retain perceptual quality with 4-bit precision.
Full Transcript
[Music] hi everyone thank you for joining us today we have a special guest speaker with us uh zi is a PhD student at Berkeley AI research and his uh work is in efficient deep learning algorithms for large language model and generative today he will be presenting his latest research on quantizing diffusion models which is a very interesting uh piece of work so over to you thank you so much for joining us thanks aad for the introduction and let me share my screen now so today I'm going to I'm I'm happy to present this like uh our Recon work K diffusion quanti in the diffusion models and this is the joint work um with people with collaborators from n University uh uiu and pikin University so basically like uh in the past years like diffusion models have shown great success in generating images with both High diversity and Fidelity and we all know that stable diffusion and uh mid Journey d three they they all very popular and can create some amazing um pictures uh however this kind oh however the generation process of it uh is prettyy slow right now and some can be pretty slow which becomes a primary obstacle towards broadening the um applications of diffusion models and the prior work has been solving this problem by basic uh by mostly reducing the number of steps required in the den noising process as um like as shown in the figures below uh which is like a a very simple demonstration of how diffusion models work by just iteratively adding noise to the image and in the diffusion process and iteratively uh Den noise those um those noise up from aulions um step by step in the den noising process to generate samples during the inference time and you can see that this iterative process is quite uh redundant and those prior Works have um leveraged various different approaches to make this more efficient like doing this Progressive step distillation to distill many steps into only one or two steps or like five steps a few steps or using some uh better finding better shorter Solutions of the ordinary differential equation in the diffusion model such as the work like DPM solver Plus+ using some other mathematical insal matters like ratified flow in the insta flow or the recent uh consistency models Pro proposed by open a um like which is the technique that D 3 is based on and however there is another important factor that has been largely underlooked uh which is the fact that the noise estimation used in the diffusion model in it each iteration itself is computer memory intensive uh let me first look at the examples of stable diffusion and the figure on the right is a basic demonstra it's just like a simple illustration of how stable diffusion uh is formed and it has a backbone of a unet uh that has like about 860 M million parameters which um and while the whole model with the unet combining with the tax encoder the VA encoder Auto encoder and has a parameters of 1,000 million in totals so basically you can see here the unet occupies the majority number of a majority portion of the parameters of the stable diffusion and as long as we um make this unet efficient we will just uh it we can stop the major bottle neck in the stable diffusion and makes the whole diffusion model efficient and um this so 800 million parameters is not a small number May uh it not only slows down the inference speed but can also propose crucial challenges in terms of high memory footprint and moreover um the reent more advanced model open- Source model like stable diffusion XL has like three billion parameters this is close to half of the size of the Lama 7B the like those LMS we usually talk it's very crucial to make LM efficient so uh from this perspective we can have a um we can have a like more detailed more specific understanding of why we need to uh why efficiency is pretty important in the diffusion model space and uh if people have been uh if people have been paying attention to the Past uh to the progress of L in the past year and you may have noticed that conization becomes a quite popular techniques to speed up LMS uh to make to reduce the memory requirements of LMS and similarly we can also laborate this technique in devopment models and um and in case that people are not familiar with what chronization is I'll just use the very brief slides to uh go over the basic concept so in the equalization we basically convert ws and elevations to lower Bas formats like previously the ways of the and activations of the models are usually St in fp32 and or fp6 which is used more often uh more commonly during the inference time and this takes like 16 bits of um memory uh of memory requ requal to like store this 16 16 floating Point F6 floating Point format however we can actually use like lower bit formats like int8 or even int four just using integers and using just four bits of eight bits to um to represent those weights and activations and by using this kind of mapping uh from the floating Point range to mapping the original n original values uh that's represent Ed by 4 Points into a shorter range of values represented by integers and this can uh obviously this can reduce the M memory requirements and also uh if we use efficient M matrix multiplication kernel for integers it can also reduce the compute and accelerate the compute yeah and this is just like a pretty um animation of chronization is in high IDE okay so um in this paper we we propose to use post training chronization uh PDQ to directly um quantize the well trained diffusion models without retraining and uh we want to apply P this pic approach to compress a noise estimation model and uh so what's the special things about diffusion models in terms of when it comes to conization there are actually some crey distinctions between traditional PDQ use cases and PDQ in diffusion models uh and this this present both unique challenges and valuable opportunities so and so we want to just like uh leverage those opportunities and efficiently mitigate those Associated challenges by our proposed Q diffusion framework so at first let's take a look at the conventional scenarios of PDQ like the conventional scenario like when people usually uh use PDQ post training colonization in the past uh this usually involves a calibration process that employs either training or the original training data or synthetic data to enhance the accuracy of the quanti model like we we need to calibrate the quanti model like adjust it a little bit by using from those data and which is shown in um on the left of the figures and so an obvious problem brought by this is that those data may be either unavailable due to privacy concerns or poorly aligned if we if we try to if we want to use the original training data or it is fully aligned with the tar data distribution if we just want to use synthetic data and the quanti model and also in the inference time like the quantise model only needs to be um gone over once going through once in uh when you want to do one a single inference to get results that's in the conventional case but in diffusion models um a very an interesting perspective an interesting prop is that we don't need to we actually don't need to go over this kind of um calibration uh this synthetic data or like using the original training data procedure when we want to consider creating the calibration data set since during the INF time diffusion model is not taking something um that's in the data set it's just like taking a sampling a gence uh so we can adapt adapt this kind of unique calibration process by simply Gans to generate calibration data with the full Precision models and this process can always be data free and aligns with Target data distribution given the model is well trained so this is the opportunity but we also have the uh but there are also challenges and the iterative computation in the diffusion diffusion process can lead to accumulation of chronization Errors over multiple time steps and this this makes applying conventional PDQ methods especially when just using naively chosen calibration data sets prone to significant performance drop as shown in the bottom right of the figures if we just uh random if we don't calibrate and if we don't deliberately sampling our calibration data set to calibrate calibrate the diffusion model the errors will accumulate um so the model won't be calibrated in a good stage and the errors ization errors to accumulate across the time steps uh across all the time steps during the den noising process used in the diffusion model inference time and the bottom left shows um the the sampling the calibration data sampling approach that we uh that we use in the C diffusion framework which I will introduce later just in the following slides yeah so last uh so let's let's take a look at this issue in Greater details so these figures shows that uh solely relying on linear conization um can lead to the erors accumulation across EPs and which we can we can like guess some number U more quantitatively in the right figure so if we just doing if we just do a wrong to near quantization if we don't do any calibration this can lead to a rapid increase in mean Square errors between the full Precision outputs uh as we progress through the time steps and this effect is particularly pronounced on Lower Bitcoins or on Lower bit precisions like the five in five and in four the mean square eror is much higher than the than the in a case um if we don't do any calibration so and if we so uh if we do the calibration if we do some calibration and use the calibration data set chosen by the cusion framework we can actually manage to maintain the errors within an accept acceptable range like the Ino errors will be similar to the in a errors um uh natively and this is actually U pretty acceptable and I will show this in the following slides and and this which can be seen in the Doty lines in the figure by the Doty line in the figure so uh what's the what's the what's the reason behind this like why uh we get this accumulated errors uh why cannot we just simply choose some random um cions and and calibrate the model effectively so one underlying factor to this is that the there is a significant variation in the intermediate activations across different time steps and given so given that the the noising occurs incrementally throughout the process we observe that inputs at adjacent or like consecutive time steps will have relatively similar distributions and inputs at distance time steps Distributing more diversely as shown in this figure like these are just like the activation range uh across the different time steps like um this you can you can see from this figure the changes are pretty gradual and and the trend the the the numbers at the beginnings are very different from the number in the end and this patterns um happens um for all the intermed so so the previous figure is for the final output and this patterns actually holds for all the intermediate layers uh imediate output at each layers like we can all see this kind of similar um not similar patterns like the in the neighboring time steps the the difference is not that large but in the distant times but if you just like put this into a wider range then in the dist time step you can have pretty different activation distributions and this is very this is actually quite intuitive as the den noising process is just like starting from a sample taken from an some Tropic gion distribution and you and the noise is gradually added to the image so intuitively uh the the sample at neighboring steps won't be that different but the S sample in distance steps will be pretty different and and this process is actually quite um quite uh uniform won't be very uh the change won't be very Fierce across steps um as long as you are not using too many steps sorry too few steps so uh this led us to design a Time step of we calibration strategy which uniformly segments the time steps into fixed intervals and draw an equal number of samples within each intervals as shown in the figure and in this way it ensures that our calibration data set to have a comprehensive coverage of the activation modes across the entire temporal spectrum and which will make which will create a high quality calibration data set then ideally makes the model adapt to different adapt to all the different activation distribution across all the time steps and uh the the the figures uh and on the right is just like high level overview of our algorithm and I won't go over the specifics at this moment but uh and if you are very if you're interested in this you can uh it can be found more details can be found in our paper but in essence so we want to petition the noise estimation um Network into distinct blocks like here it's like uh one two m n blocks and we calibrate we conduct the calibration using a blockwise adapted rounding strategy and with the calibration data set sample uniformly across the segmented time step intervals and I mentioned as I just mentioned and this will be done for both the waist and Activision yeah so this is the this is the calibration this is from the calibration data set perspective and we know that in diffusion model we need to uh create this kind of calibration data sets considering all the considering the iterative sampling uh iterative yeah iterative iterative process in the def model in and um and a natural question here could be that why we just do this uniformly and Will are there better ways like this sounds pretty naive and we actually did some uh exploration in we actually did many exploration in from this perspective and I will talk about this in the oblation studies later in the presentation now let's just like um move to the other parts a little bit so other than um other than the challenges introduced by the iterative sampling inherent in diffusion models um the architecture of the noise estimation network uh used in the diffusion model presents its own challenges so usually a unet is used in the diffusion model and we found that there are abnormal activation weight distributions in the shortcut layers in the in in those units in those um in the short hard layers of the noise estimation model like where where you just concate the the shallow features and deep features uh within uh using those shortcut and this actually creates very abnormal activation weight distributions and can and if you just quantize this layer as a whole you can H the quantization performance like as demonstrated in this figure the three shortcuts layers in the unit the first three shortcuts in the unit um can have very abnormal input activation ranges and this is also just like um a further illustration of of that this pattern can be observed across um different models different data sets and uh like in different data sets like c41 like the stable diffusion uh it for stable diffusion I think um yeah I think yeah for stable it's just like the model like in in the INF time of the model so in in the DD TR on Sil St TR on lion and those lat diffusion TR on El SS we can all observe similar um similar abnormal distributions in the shortcut lay in some of the short car layers um in the in the in the model in the noise estimation model and sometimes this is large like in C 10 in stabil diffusion you can find that those activations are very abnormal like one 1,000 uh it is 1,000 while the others are below 200 or like 300 while the others are below 50 but for other for other um models train on other data set it can be um less severe like in the Elon bedroom or Elon church it's not uh the it's not that of abnormal so um so we don't need always we we don't always need to apply this shortcut uh we don't need always we don't always need to address this issue but once it once those differences become those abnormal patterns become significant if we don't address it then it will become a big problem and um so because those shortcut layers merged the deep and shallow features and it can and it conducts this kind of concatenation so the the distribution the this abnormal distribution in nature can have a very um very clear patterns like it will just be a bodal distribution in the corresponding channels and since it's just like a concatenation and so we we propose a very simple a very straightforward technique to address this issue and we call this the split quantization techniques which performs the quantization prior to the um to the concatenation and uh basically uh as shown in the figures in the right like we just uh split the both the inputs and the weights um and applies the applies quesions to the split it to X1 and X2 separately uh where X1 x2's Activation so applying the Torv quantization and also applies the quantization for W1 and W2 separately and since they are ways we use Channel Vis quation and in this way um and we we we we conad them and we only conad them after uh after the clation was performed um so in this way is it's just like it is is e equ equivalent to just quantising them as a whole mathematically and it introduces like just negligible additional memory or compute just like a one additional one one one group of additional scaling factors and and zero points which is basically negligible compared to the memory saving brought by chronization and but this will significantly stop the uh eleviate this issue yeah so here so um until this point uh do people have any questions so far okay so then um let's let's take a look at the results so first we did some experiments in the unconditional generation and just want to remind you guys a little bit so this work is actually quite early like it's uh it's most of the work done last in the last year like around the similar time last year and at that time um stable division hasn't been out for too long so we are still using those conventional task to evaluate models and the space is not that um it's not that uh amazing like what we have seen right now just uh just want to give you guys a head up about how this Fields is evolving how fast this field is has been evolving over the past year and so first we did some experiments in uh by in the unconditional generation case like we use those conventional data set of the Elon bedroom and Elon Church which is um which has the resolution of 256 time 2 56 and the figures uh you can see that qualitatively qualitatively while the image quality generated using linearization basic by linearization I mean the native one to nearest quation and while the those qu the quality of image is generated by the linear Quon linear chronization degrees sharply under 4bit under 4bit Precision uh C diffusion can largely our C diffusion can largely retain the perceptual quality and introducing only imperceptible dist distortions like in the this like w48 cases in bedroom and church is the the image are basically identical to the F32 models results and quantitatively we also it also reveals that Q diffusion all performs traditional un uniform picq approaches by a subst stantial margin uh across all testd resolutions and types of diffusion models like uh other than the wrong to near is linear colonization we also use another state-of-the-art qu uh PDQ method at a time which is squant and and we can see that in the lower Precision like four base weights um Q diffusion can has a very big Advantage um over those traditional PDQ approaches like basically they will um incre some significant loss in the generation quality if if if those traditional approaches are just applied natively and while in our cases if if the C Q diffusion framework is applied there are actually just like very small or NE uh very small degradation in terms of the FID scores and and the in Inception scores yes so this is the qualitative and quantitative this is just like a brief overview of the qualitative Quan quantitative results of the unconditional generation and let's take a look at um some more realistic task like people some test that people care about more uh which is the results in the text to image generation and we use stable diffusion 1. four at that time um to and as the as a model to test on so from this figure like uh we use the prompt the the the CL the classic prompt photograph of an astronaut riding a horse and in in those figures like you can see that Q diffusion Contex Model S high quality images that face F represent the semantic information and both in the F W4 a32 case and W4 A8 case uh like for bit ways and F32 activation or 4bit way 8bit activation uh I just realized that I forgot to um cover what uh the notations but uh it's pretty straightforward and hopefully there's no confusion about what w w 4832 and w48 M and while in the linear clation case the the the just basically becomes very corrupted and the quality just like becomes very bad and so uh so this is actually pretty interesting results and uh it shows some promising results in uh uh it demonstrates that c diffusion can actually quantize the although we do not specifically design Q diffusion for for uh those classifier guid classifier free guidance ways of text to image generation conditional generation it can still work for those um text to image generation uh like in stable diffusion and to the best of our knowledge at that time this is the first study to successfully Implement stable diffusion at 4bit Precision like we um like we don't see any other people uh achieving for bit at that time and even even for now I I don't see there are many works that can get into this lower this low bit of precision and I I will introduce other works in the later slides I just like briefly go over other work in the later slides so and um in addition to the generation quality like another I just want to point out a very interesting uh properties that we observe during the experiments like so if you look at those figures in the red circle you can see that there are some interesting differences in the semantics of those generative pictures like in they can be in different ways like in the first in the first row uh the people on the horse has different wear different colors of Astron um different colors of suit like it can be color it can be involment it can be alignment it can be posture like in the second row it's just like the hor the the picture is just like put into very different environments in in the third in the third rows the the fp32 model actually doesn't face F follow doesn't face isn't uh face F aligned with the prompt while our 4bit Quant models um have the correct semantics align with the promps aligning with the prompt and in the last row the posture of the horse um are different so this this is actually a pretty interesting um properties we observe when doing those conization experiments which and we didn't so this works um it this isn't really the focus of this work like we don't actually investigate this property in details but I think um it shows some interest it may shed some lights in some interesting research directions like controllable generation or we can improve the conization or maybe even using conization as a way not only to improve the efficiency but also as a way to uh improve the generation quality like and since and and this kind of makes a little bit makes some sense as it's not it's not uncommon to use those quantization to use some general quantization techniques in in uh in the computer vision like we have those V Gans or stuff like this so maybe sometimes qution can be helpful but this is just like some uh something I would like to talk about we didn't really uh go into details of of this stuff yeah and there are just like more examples of uh the other problem puppy wearing a hat and we can see similar we observe similar results as what we observed in the in the other case that I Just sh I just and we can also see that there are some um semantic there are some there are some changes in the semantics by from the images generated by the quiz models um and sometimes the the the the difference is a large sometimes are small and just like some very interesting and it's hard to tell which one is better but uh both of them have high quality and both of them align with the prompt but uh but they are just like semantically different so this is just like pretty interesting and here is just like here is just some um some experiments of the of evaluating our quanti stabil diffusion qu quantitatively by using the FID scores and clip score computed in the MS Coco data set and following in the previous people's approaches and in this when they study the C diffus and uh yeah actually is following the SD 1.5 official reports approaches and U the C diffusion doesn't really have a maintain similar clip scores and has negligible increases in FID so also similar FID scores um as the fp32 cases like which is represented by the Dy line uh in both in both Figures it's the Dy line uh where where in F is 20 and in the C score is like 0. 2865 and and the interesting thing about FID is that Q diffusion Q diffusion is actually achieving better results than fp32 and which doesn't uh which is quite weird and we attribute this to the fact that FID is not a good metric in evaluating when the model becomes so good as stable diffusion uh when when we want to eval the quality of text to image generation in in stable diffusion and I think this is actually um proposed by many other people as well in the diffusion model just in the diffusion model um field like right now is evaluation is always a difficult thing and those uh qu quantitative matric can just be used as a reference they don't actually means whether qual whether the quality of the imag is um um better when when the result when it cannot reflect uh when the models are just pretty good and the the small variations in the number cannot um reflect uh whether one model is better than the other uh with this that like a very high FID scores or very low C SC definitely indicate that the model is not good which is the case of the r to nearest linear chronization so this just provides more evidence showing that our matters are pretty effective uh when it is applied to the stable diffusion yeah and we then perform some we also perform some ablation studies of the effectiveness of each of the techniques uh we propos that I mentioned in the previous slides the the calibration uh the calibration with the with this uni phone sampling and the shortcut splitting so in the in the when when we just like quize the weights we if we keep the activations to full Precision like uh the shortcut splitting so the calibration will be very effective and shortcut splitting also helps a bit so both of that helps that shortcut splitting is not that crucial but when we also quantize the activations like now we have this by abnormal distributions in both ways and activations and we will find that only using the calibration will not be enough to quantize the model to recover the performance of the quanze model while simply doing this uh shortcut spting quantization um we can recover the performance quite well yeah so it shows that the effectiveness of both of the propos approaches and they will be very useful in different cases and qualitatively we also have these interesting figures like uh if we don't consider if we don't consider calibration if we just do a splitting the shortcut splitting chronization like in the in the six speit Precision case uh it can already like get some improvements like the T of the previous figure T before applying the split kind of um not in a good shape but after doing this shortcutting conversation the te of the the te of the this person just becomes go back to normal and then if we also applies the if we also applies the calibration if we applies the order techniques of kill diffusion and in the four bit Precision case like uh the Genera image will just be um almost identical um to the fp32 result results will just be aligned with the FP 32 results very well although although this I although I don't think like uh the ice of the third figure is better is actually actually has better quality than the ice of the second figure but it does align better with the full Precision results showing that our quiz models Rec cons uh guess more Fe F compared to the full Precision model yeah and as I mentioned earlier when we talk about the calibration data say sampling like we explore uh whether we can get some better more advanced sampling strategies other than uniforms sampling since this is just like too intuitive and pretty pretty simple and can we actually do better and like we tries many different um different like criteria of sampling like we tries to which tries to uh which like which tries the STD which is like ass signing the ways assigning the waste coefficients of the uh of of each time step by the pixel wise STD of the of the activations and the intuition is that we want to sample more data from the time step with a larger variance in its distribution to better cover the whole um time uh temporal spectrum and we tries a variant of this SD meths by using the norm of the activations instead of the just the values of the or the average of the values that of the pixels we Al tries the self-supervised U Learning meths called on supervised selective labeling and uh this m in the and the original paper the original USL paper ends to select both representative and diverse samples for self supervised learning and we hope we we think this principle can also be applied to the calibration data set creation in the conversation case and we also tries this USL methods and the last methods we didn't conduct the the last matters normally Distributing normally distributed time step calibration collection and this is the methods used by another paper another concurrent word called PDQ for DM and which is another work that explores quation in diffusion models and we didn't really want some extensive experiments as the previous three that we proposed when we tried this dntc methods but we uh we did reproduce their we did try reproducing their results uh in that's described in PDQ for DM paper and the results are a bit dis disappointing like like we didn't really get better results by any of the technique um that we tried any of the technique that may may consider more than the uniform sampling like we actually just guess a little bit worse FID scores SAR 10 and we fail we fail to observe significant differences between uh those different sampling strategies and we we don't observe improvements and we assume that this um act the the the reason behind this might be that the the uniform sampling is already pretty is already good enough like we think a key of of creating this kind of calibration data set is that we need to cover all the time steps we want to cover the entire temporal spectrum and and once we cover the entire temporal Spectrum it can align with the motivation of it can just like aligns with how diffusion is done like we can and just like by just sampling uniformly we already address uh as long as we sample all the time steps uniformly we we already address the gradually changing activation modes across the steps there might be some other um other course gr more more fine grain consideration but they may not matter to the quation Quality that much uh if we want to further improve the quation quality it may like just like using different calibration data set might not helpful you need to do something else and in terms of and speaking of calibration itself doing uniform quation is already good enough like as shown in the in the figures in the two figures um on these slides like uh when we just sample part of the time step part of the temporal Spectrum you the results just decrease significantly like uh first 25 first 50 last 50 and last 25 and the mid 50 kind of has some coverage of the early time step and some coverage of the later time steps it gets better results than the other cases but still worse than sampling uh the full full time steps full range of time time steps and this can be Illustrated in the figures on the right and the figures on the right um in addition to exploring covering the whole time step all the time steps the the figure on the right also uh FL some hyper parameters choices of uh the uniform sampling and and as long and and from the figures it shows that as long as we cover all the time steps the results won't be that different and we just choose we just choose one's configuration to balance the number of samples requ and the quality of the Gen and the generation quality in terms of FID and which is denoted by the green star sign a green star label and we and we did some exploration of combining cill diffusion with some fast sampling approach to see that uh remember at the beginning of this talk I mentioned that there are two there are two lines of work there there are two factors that we can tackle on to saler diffusion model to make diffusion model more efficient and previous people mostly do uh mostly focused on Fast sampling and want to see if we can combine Q diffusion the the the matters that tackles the noise estimation model with those fast sampling approach to just like get a to join Force to get a better um to to boost the efficiency of the diffusion Model S Mo and we use we follows the we follows the instructions of the DPM solver Plus+ paper we use the third order DPM server with like either 20 Den nois in time steps or 50 D nois in time steps like the 20 is basically for stable diffusion since originally stable diffusion already just use 15 50 times deno in 10 steps time steps and after applying DPM solver Plus+ we want to reduce this to plenty and from the table on the left uh we see that Q diffusion Works decently when only qual in the way the FID doesn't decrease that much but uh doesn't doesn't get doesn't get worse that much but uh once we also apply the activation quantization like uh the results actually kind of decrease significantly and we think this is because that at the activation distribution at each step will significantly change after changing this kind of sampling trajectories by applying DPM solvers Plus+ and this probably need further calibration to align the quiz model with this changing activation distributions and and the figures on the right just show some just illustrate some samples generated samples from the stable diffusion by only qu ways and applying the DPM over Plus+ to accelerate uh the sampling and uh the most of the qualities from the figures most of quality are preserved after applying DPM Ser plus Plus+ and the C diffusion so and we and this could be a interesting direction to explore in the future like uh when people wants to guess an end to endend solution that um they just like exploit the efficiency of diffusion models to the best of the extent yeah so here are just like all the experiments that we have conducted in terms of performance and in terms of like the quality the generation quality or the theoretical saving in the m memory or like or the theoretical Precision that we can get without losing quality in uh when applying quis to some precisions and these are just like all the results in the paper and I also would like to talk um I also like to discuss and talk about uh the implementation a bit and since since if we really want to um make this work need to implement this in an end to endend Fashion to get real speed up and memory saving so previously all the results are just St by simulated chronization which people sometimes call it fake chronization and and it doesn't really get speed up memory saving it's just like applying this scaling factor and zero points and and this casting and the integer casting to the results to to to the to the inference process to mimic the uh the proc process process of um of what real conization will be like and we can we can get we can get a same perform we can get the same generation quality as when we apply end to end clation but it's just a Sim simulation and there are some there already some existing implementation of quanti diffus stable diffusions but um they all have some pretty large limitations like um two open source ones the apple core ml implementation Apple just open source Apple open source and quiz diffusion stable diffusion implementation in core ML and there's also an Intel open vinyl implementation so these two figures shows that uh once this kind of so many people have asked me what will be the speed up and memory saving when we really implemented this quantization and I think these two figures can give people some sense about there U about about that there will be real speed up um when using the when applying quation and the memory saving will be also be very significant as shown in the this open vinyl figure which is which is a Pokemon variance of the stabil diffusion and you guess like a zero point it reduce the memory fullprint by four times when uh doing the ab quantisation uh which is just like the theoretical limit and we can do this even further when doing for bit um yeah so so these are just like some some existing implementation of quanti stable diffusion and um and which can give people some sense of how C diffusion will accelerate diffusion models when it's implemented in an endtoend fashion and for some future work um there are a few different directions and there are many directions that people can do like first uh we we need more comprehensive and dedicated studies on conditional generation on on stable diffusion and since this paper uh just want to emphasize this again this paper is done pretty early it's finished pretty early at that time we didn't really consider conditional generation that much and we just we just use stle diffusion as something similar in the fashion or Pro concept but in order to make quation really useful in real life we need to test on more advanced model like stable diffusion Excel we need to try more variety of test like image editing in painting out painting all those things and we also need to address uh different issues different potential issues in those more realistic use cases like how will quation be respond to non-trivial promt engineering like if you guys have play with um D 3 or or mid Journey which is a more suitable example like you uh you may have the feeling that usually you need to use very long prompt to get the results that um to add many constraints and then you can finally get the results that you uh that you expect and also combining the qu quanti model with control net with those different customized Lura modules and see how the performance will will be affected by quantisation so this is the most important directions future directions and um and uh if you guys have paid some close attention to the figures I displayed in the previous slides uh there can actually it can Al it can actually be observed that there are still some artifacts when uh when activation clation is also applied in addition to the waste clation and and this can actually be further improved like we may there there could be some more dedicated quaning techniques for diffusion models to address those issues to further boost up the performance the the generation quality and there already many follow-ups like uh for the sake of time I won't talk about this uh each of them but uh the key idea is like each of these matters like pqd adpd and tdq TFM qdm or each of these matters address some different uh issues different short uh shortcomings of the C diffusion and um and to further to to study to better study the diffus model quation and boosting out the performance and I think there could be more there more things could be done in this sense since um I don't think any of these matters really address the previous point I made like they they mostly just follow my our papers evalu protocol but uh uh they they have some improvements but we still don't know uh how they will perform on those more realistic use cases like non trival from engineering or with with control neck added and we can also try leveraging some other advancements in chronization in general such as using smooth Quant to elevate for the to elevate the activations in the linear layers of stable diffusion like in stable diffusion X the linear layer is actually quite large and using smooth San could be potentially helpful or using some other quation formats that has been recently um uh attracting more attention after the hopper architecture is proposed by Nvidia uh is yeah is proposed by Nvidia like the fp8 or fp4 format other than in a and info and this can potentially also boost up the perform the quality generation quality or like make clation easier and those are all just some valuable future direction to explore yeah and and in the end I'm Al I just want to bring up the topic of end to end conization implementation again like as I as I talk about in this slides and this is just like a very this is not um this is not a research problem but it's just a very important in terms of uh adoption in terms of making this qu even really useful and right now I think people all have the feeling of the the matters will only be useful uh once uh the matter the the the proposed paper will only be useful when it's adopted by the community and which means we need to get a working implementation for the community and there are no open source andn quation diffusion model implementation for C attribs like previously I brought up the Apple corl Library I brought up the open vinyl but Apple corl they only did weight clation and they actually using nonuniform clation so this cannot be done for activations and it also just work on it also just work it also only works on the Apple devices like MacBooks or iPhones and open Vinos is also only for Intel CPUs and for diffusion model people it will still be preferable to use gpus for the inference and maybe some lower end GPU and the me the the meaning of the purpose of quation is to make those deer models work on the lower end gpus and many people have brought up this up in the issues in in our GitHub uh and I actually consider implementing this was on our road map worse on our road map but it's just like uh we later found that is very long tribal to implement like uh the following the the figure here is just like something I put in the issue of our dat and uh it's non trival to apply to implement this n to end quation for both way and activation since uh diffusion model is not like lrms it doesn't only have the linear layers also have the commn layers and and and and surprisingly there's no good open source implementation of in8 comp even in eight um com layers like even for the intake com a native intake com layers we need to do some uh payon efforts to make it work and we and one one technical one one pass that I I was considering is to use catas and which is like a customized Cuda libraries um created by Nvidia and I was considering using K to implement this in8 CL kernels and and also the in8 linear kernels and but um this this just like needs um to needs us to pay some non-trivial efforts and right now we are very uh we very overloaded with other pressing commitments so this is an important topic but we don't know when this will be happening uh once we have time we definitely want to do this but other than that I just want to bring this up and let people know that this is a very important thing and this is surprisingly surprisingly this is still lacking in the community and if any people is interested feel free to contact me and see uh if you and discuss how how people can do this yeah yeah so this is just like this is all of my presentation and thank you for listening and now I can please feel free to ask any questions about my talk uh great talk uh I had a small question uh you mentioned that there are abnormal activations uh in in the diffusion model so can you uh share some more light like what's the reason we do have these abnormal diff abnormal activations and like can we fix it and will that improve the quantization process uh so yeah sure so just want to make sure I understand the question clearly so mean that uh I asking I talking about that if we can if we can fix this AB normal abnormal activation without changing the chronization process um that's a good question um we haven't thought about this too much since it's just like pretty simple to do this split uh shortcut splitting technique and so one thing I can immediately think about that which may not be useful in this case but it's Sol uh but the problem kind of similar is the smooth Quant technique like which kind of which apply which can be applied to the linear layer and that mitigate um the abnormal activations in in in one in one of the ways or activations by just like averaging these kind of abnormal channels um uh who both of the weight and activation matrices but I I'm not sure whether I don't I think this is is not uh in in a div model case it's not the same and also um and also this abnormal distributions can also be bued as um as a general can also be categorized as a general problem of outliers in the quantization research outlier is always an problem like quantization research basically it's all about how to deal deal with the outliers and usually dealing with is not it's not easy and I think this shortcut splitting is is already an easy way of addressing these all lers or addressing these abnormal um activations compared to other off lers um other approaches to deal with all lers in the previous quantization literature since we have this um we have this like explicit pattern of concatenation yeah Does this answer your question or yeah thanks any other questions yeah there's a question in chat there's a question in the chat so that's that's a good question so we So currently um my research is actually focusing on some other stuff and I think right now we don't have a plan to further explore this diversity in the lower levels semantics between cill diffusion and FC models but I do think this is an interesting uh research topics and uh just and I well and just feel free to uh discuss this with me like uh if you are interested in this topic but right now we don't have a plan to uh work on this ourselves yeah all right I think we can end the session thank you so much good yeah thanks Amad for organizing this event yeah my pleasure take care everyone bye bye
Original Description
Xiuyu Li presents Q-Diffusion: Quantizing Diffusion Models
Xiuyu Li is a Research Intern at Meta and a Ph.D. student at Berkeley AI Research (BAIR) at UC Berkeley, advised by Prof. Kurt Keutzer. Previously, He received a B.A. in Computer Science and Math from Cornell University.
Details: "Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks, it does not work out-of-the-box on diffusion models. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models, which compresses the noise estimation network to accelerate the generation process. We identify the key difficulty of diffusion model quantization as the changing output distributions of noise estimation networks over multiple time steps and the bimodal activation distribution of the shortcut layers within the noise estimation network. We tackle these challenges with timestep-aware calibration and split shortcut quantization in this work. Experimental results show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance (small FID change of at most 2.34 compared to more than 100 for traditional PTQ) in a training-free manner. Our approach can also be applied to text-guided image generation, where we can run stable diffusion in 4-bit weights with high generation quality for the first time." abs: https://arxiv.org/abs/2302.04304
This session is brought to you by the Cohere For AI Open Science Community - a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Cohere · Cohere · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Andreas Madsen on Independent Research and Interpretability
Cohere
Plex: Towards Reliability using Pretrained Large Model Extensions
Cohere
Independent Research Panel Discussion
Cohere
The Future of ML Ops: Open Challenges and Opportunities
Cohere
C4AI Special - Grad School Applications
Cohere
Cohere For AI Fireside Chat: Samy Bengio
Cohere
Cohere For AI - Scholars Program Information Session
Cohere
Modular and Composable Transfer Learning with Jonas Pfeiffer
Cohere
Jay Alammar Presents Large Language Models for Real World Applications
Cohere
Catherine Olsson - Mechanistic Interpretability: Getting Started
Cohere
How To Prompt Engineer a Tech Interview App | TOHacks 2022 Winners
Cohere
C4AI Sparks: Samy Bengio
Cohere
BERTopic for Topic Modeling - Maarten Grootendorst - Talking Language AI Ep#1
Cohere
Exploring News Headlines With Text Clustering | Jay Alammar
Cohere
Scale TransformX | Fireside Chat: Aidan Gomez and Alexandr Wang
Cohere
Making Large Language Models Accessible | Scale AI Fireside chat with Bill MacCartney
Cohere
Intro to KeyBERT - BERTopic for Topic Modeling
Cohere
Intro to PolyFuzz - BERTopic for Topic Modeling
Cohere
API Design Philosophy - BERTopic for Topic Modeling
Cohere
Code demo of BERTopic - BERTopic for Topic Modeling
Cohere
Short texts vs long texts in BERTopic- BERTopic for Topic Modeling
Cohere
How People can help BERTopic - BERTopic for Topic Modeling
Cohere
Cohere For AI: Training Sensorimotor Agency in Cellular Automata with Bert Chan
Cohere
Cohere API Community Demos | October 2022
Cohere
Perfect Prompt Demo By Arjun Patel
Cohere
Project Idea Generator Demo By Tobechukwu Okamkpa
Cohere
SuperTransformer Demo By Amir Nagri and Team Megatron
Cohere
Cohere For AI Fireside Chat: Pablo Samuel Castro
Cohere
How Startups Can Use NLP to Build a Competitive Moat
Cohere
Build Chatbots Faster with Large Language Models
Cohere
Tools to Improve Training Data - Vincent Warmerdam - Talking Language AI Ep#2
Cohere
Utku Evci - Sparsity and Beyond Static Network Architectures
Cohere
Adding human intelligence to ML models with human-learn #shorts #machinelearning #nlp
Cohere
Iterating on your data with doubtlab - Tools to Improve Training Data
Cohere
Adding Human Intelligence to ML models with Human learn - Tools to Improve Training Data
Cohere
Scikt Learn embeddings helpers with Embetter - Tools to Improve Training Data
Cohere
Building Cohere API Demo App With Streamlit | Adrien Morisot
Cohere
Rosanne Liu - career creation for non-standard candidates
Cohere
Giving computers many human languages with Cohere's multilingual embeddings
Cohere
Learning by Distilling Context with Charlie Snell
Cohere
Sentence Transformers and Embedding Evaluation - Nils Reimers - Talking Language AI Ep#3
Cohere
Reflecting on for.ai...
Cohere
Create a Custom Language Model with Surge AI and Cohere
Cohere
Cohere API Community Demos | November 2022
Cohere
Cohere API Community Demos | December 2022
Cohere
Cohere For AI Presents: Colin Raffel
Cohere
Lucas Beyer - FlexiViT: One Model for All Patch Sizes
Cohere
What is Neural Search? Nils Reimers - Sentence Transformers and Embedding Evaluation
Cohere
Evaluating Information Retrieval with BEIR
Cohere
Evaluating Embeddings with MTEB Massive text embeddings benchmark - Nils Reimers
Cohere
High quality text classification with few training examples with SetFit
Cohere
Multilingual and cross lingual embeddings - Nils Reimers
Cohere
Developing open-source software: lessons, benefits, and challenges - Nils Reimers
Cohere
Ask Me Anything with Ed Grefenstette, Head of Machine Learning at Cohere
Cohere
HyperWrite Powers Its Generative AI Service with Cohere
Cohere
EMNLP 2022 Conference Special Edition - Talking Language AI #4
Cohere
Cohere API Community Demos | January 2023
Cohere
C4AI Sparks: Rosanne Liu on Career Creation for Non-Standard Candidates
Cohere
Michael Tschannen - Image-and-Language Understanding from Pixels Only
Cohere
How to Add AI to your App
Cohere
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
FREE AI Sin City Photo Generator — Turn Any Photo Into High-Contrast Noir Art (2026)
Dev.to AI
Google makes Gemini’s personalized image generation free for all US users
The Next Web AI
Gemini’s personalized AI image generation is now free for U.S. users
TechCrunch AI
WebP's Compression Secret: How a 1MB PNG Becomes a 200KB WebP
Dev.to · swift king
🎓
Tutor Explanation
DeepCamp AI