Inside TensorFlow: TF Model Optimization Toolkit (Quantization and Pruning)

TensorFlow · Advanced ·🏭 MLOps & LLMOps ·6y ago

Skills: ML Maths Basics90%ML Pipelines80%Supervised Learning70%

Key Takeaways

The video discusses the TensorFlow Model Optimization Toolkit, focusing on quantization and pruning techniques to improve model performance and reduce size. It covers various aspects of quantization, including fine quantization, integer quantization, and hybrid quantization, as well as pruning and its compatibility with quantization.

Full Transcript

we get started so hi everyone I'm Sebastian I'm going to talk about the tesla model optimization toolkit which we have techniques for quantization and pruning and feel free to ask questions or interrupt along the way I want this to be like super interactive so what we're going to talk about today is the high level of what quantization is the challenge it poses and why it matters and then more of the specifics on in tensor flow what we are doing to work on quantization and pruning so overall quantization the idea is that you have your floating-point network with your inference graph which is a floating-point program and we're gonna make modifications to this program in the general sense that we take these floating-point calculations and make them lower precision and the goal is to get as close in accuracy as possible while providing some performance improvements so usually this involves like this is very general we there's some function from the floating point to the integer value there's a process to do the conversion to make it valid for a particular hardware and then there's various algorithms we have to do these get these parameters needed for this function in the most efficient way so this is really general and what may not make sense now but we'll make it more specific later mmm do the same conversion functions work for mobile devices as well as specialized Hardware no and then that's one of the challenges and we'll get to all the challenges that's really good question in that the may be achieve so I don't know the question like would you also be like motivating soon like why is this why is this not as simple as like a downcast from float to int obviously so why does this matter so the the first thing is that the ML programs have lots of parameters and we can by using lower precision we can instantly get these models a lot smaller which can help with memory bandwidth and network cost of downloading models second if you have all your calculations and integers you can have lots zatia pnes that make the execution like the last third integers are super power efficient so on mobile this is really important and then finally this lets us explore a whole new avenue of hardware design where we can make custom chips like c star was the first than HTTP you then you deep use how are you having integer operations and this can get us cheap power efficient fast Hardware block okay so you can stop saying integer operations you say fixed point you know fraction operations so it's kind it's early I avoid it because it's like only kind of fixed line it's not like it is fixed point but when I've said fixed for in the past and then folks always say it's not truly fixed line because fixed point applies like a rescale every time you combine the two values and sometimes I get pushed back but so I'm going to like avoid because I used to say quantization then people say there's a hundred steps of quantization so the integers are the key here because that's what's providing the acceleration that's specific to what we're doing in the test full stack and the specifics I guess will make sense after we go into the equations so wise quantization hard and this was your point of that we have different chips so each chip has its own specific trade-offs it chose to make some may only support into eight some may support in sixteen some way we want power of two rescales like all these really one-off decisions to make the deployment story of how do you take a general tentacle program and put it on one of these chips really hard like for float we're starting to get to a world where we can just say float can run anywhere but for these things there's not a lot there's not a lot of standardization on how to do this the second reason it's hard is it often requires custom tool because you need extra metadata that often can only be gathered and by running inferences to know how to quantize values and we'll get more into that in detail so there's often an extra step in the process and then finally every for every specific ml problem we don't have a good answer for how quantization will affect it you can use the same architecture but just do something else for your particular task with the outputs of that architecture and quantization may help or hurt and it's pretty empirical right now where we just try it and see and we're still in the process of gathering a lot of examples but one of the goals we need to work on in like ml research is understand these models more to determine how quantization error will impact things so now more into the details so currently what most hardware's implement and what the TF uh tensorflow intends for light stack implement is a fine quantization which is this is like us milking y equals MX plus B since like seventh grade for the rest of our life basically you uniformly distribute your range into fewer chunks than you had before and then bucket eyes them and this is effectively what all quantization is currently we have different ways of gathering statistics to determine how to quantize so so going back to this picture for a second we need some sort of min and Max value to know how to quantize so this implies that we need tooling to get this information and we have two types of tooling right now there's during training tooling where you can incorporate this as part of your training pipeline and at the end of the day you have a trained model that has information on how to quantize it at the end you can also do this post training and we'll talk about the trade-offs later I hope actually so why are you not beyond the boundary of your possible values like why do you choose or do you think purposefully choose to use some values out so yeah the minute map said it's it's kind of a open question on what is the optimal min max given a tensor and if I understood the question right so you could choose to put your min max much smaller and smaller than your actual values seen and you'll get some clipping but and depending on the model and the problem we wouldn't really know if it's useful or not because sometimes models don't care about those extraneous values and sometimes they're the most important thing of the whole model I mean the tricky thing is that like when you set your minute max any videos and like into 8 you only have 255 values between the min and Max every information has to be kept one of those 255 values if you min as matters infinity infinity that's really useless yeah but if your mean is like zero in your Max is like point zero one you can represent computations a lot of precision so it's the trade-off yeah so and we do different types of these depending on the model and we've seen weird things where and it's always this battle between how much does the network care about these extreme values and versus how much does it care about the average rounding error along the way so it's always this rounding versus clipping that's like all we just play with this lot you mentioned what min and Max being primarily influenced from training right but you would ideally like to also do this at infrared like keep it as a constant feed by meaning or training or post training okay so you're straining money influences over heard like model compilation time and there's a step it's no it wouldn't be considered quantization if you just read used like float32 to floor 16 for example floor 8 or whatever so you still have a separate exponent and you have just kind of fewer bits that's not so that's technically it is like so the textbook term of transition it is quantization but the quantization we're talking about here is this integer quantization where you have a shared min Max and what you really don't want to have that people so the only thing that's super useful exactly so like in other like DSP literature it's sometimes called block floating-point where you have the exponents shared across all values of like a tensor rather than one exponent per element so in a way float is just like per element quantization so yeah so during training the idea of during training quantization is that you want to somehow get this network to be robust to this error that quantization introduces so you emulate the effective quantization in the forward mass so if you ever see these like tensorflow fake quant operations or the contrib context rewriter tool this is its goal it's saying given a graph will rewrite the forward pass to emulate the error due to quantization and then in the backward pass will do some tricks to skip over those different and non differentiable parts that quantization introduces and then the goal is that back pop will magically make the weights better for quantization and this can often get the best accuracy given a particular scheme of quantization but it's also really hard to train sometimes and it's machine learning as we all know it's like the art of making as few changes to your training to get it to converge and the second you do more often times you won't even converge if you go to too low precision and then you just have to play around a lot if you try the training round additionally it's the error introduced during training is specific for a particular target so if you want the result of your training to be portable and work across many different ships you're kind of in trouble now if they have different characteristics so by anyway compensations that the United before pass every after every like off you just apply the quantization yeah and it's a bit trickier than after every off because it's after every rescale that the hardware expects so like a specific example is like intensive fill you have Kong by Assad relu in most of these inference backends those are fused into one fat combine really and your rescales are only up the inputs of the comp and the outputs of the relative so you only your supported emulate quantization there so you kind of need knowledge of what the targets expectations are to decide where to put it so it's not just before and after every op and you just use the current running Max and min oh yeah so right now we do moving average for certain models we played with absolute min and absolute Max and it's really sometimes we use schedules to slowly manually constrain it and this is where the art part and it's not really well understood how to do that generally right now for all the mobile like all the vision models we do moving average and it seems to work pretty well but we don't know if that's optimal at all and this turns out back problems kind of magical and by crap you don't apply this at all back prop we do use this thing called straight through estimator which the main problem with this quantization is that it's like a step function so it's not differentiable so we pretend it's an identity and the great we just passed the gradient right through and this like gets lit to Train mm-hmm and there's never a case where quantization is used in training just to speed up the training it's I mean to speed up the training it's only user training because of the idea that it would speed up inference yes there is some work I don't know if it's ever used in practice but there's been a few papers over the years that do do quantization for speeding up training as well but this particular one is always everything in this talk is the goal is for inference and so this is purely to emulate what's happening an inference and oftentimes it'll be slower than slower to train these models then to actually just do a flood one and just to be sure I thought more than actually speeding up influenced the goal with quantization of our training is to actually reduce errors right yeah but reduced accuracy that you get that you lose when you eventually go to in France but the ultimate goal of this whole tooling is to enable inference performance for some particular hardware so that being said we've been trying to work really hard to avoid the need for this in most general cases like this will like during training will always be the most accurate because you're letting the effort make up for it but we think we can get pretty far with post training techniques so and after training the trade-offs are that you can't rely on this magical huge hammer of backpropagation to fix all your accuracies but you can do some things and additionally the mid the main benefit is that the user doesn't have to retrain which is a pain to retrain because oftentimes it won't converge you have to mess with hyper parameters your portability is gone so here there's like a compile step or sometimes like you were saying even i've run time there's a step to collect these statistics to do min/max so the second technique we have so we'll get back to colonization for the majority to talk about it I just wanna mention fruiting so the other technique you have is pruning which think goal is to result in tensors in your model that have many zeros and these so if you do arbitrary pruning where you're resulting model has many zeros it's much more compressible and additionally if you have a certain structure to your pruning or a certain percentage of sparsity you can have optimized kernels that accelerate things so yeah so the benefit is that you could have so many repeated values now that you can just like zip your file and you're good to go and then if you actually have hardware support for sparsity you can get faster kernels and one more point on pruning which i think is cool is that it's all the zeros are since you have so many repeated zeros and zeros and you're in quantization we represent exactly it actually works really really well with carnation and often helps conversation which is kind of there like compressing into or thought minimal way so just kind of need so now we'll talk about all the tools so yeah last year we released this model optimization toolkit which is a suite of transfer flow and tensor for light tools that aim to make all these techniques doable and let us play with play around with trying out new things with quantization and proving so you can check out it here so here's my world-famous hand like this one on Twitter and it's my hand yeah that's true we've been reusing these pictures way too much them yeah so we have condition and sparsity so first we'll deep dive in all the tools in quantization and a bit more detail on how we actually do quantization so the first thing we've done in Tesla for light is try to understand for many of the canonical models all the operations that are in there and what are some standard recipes on how to implement these fixed-point quantize kernels and the goal here is that we want some sort of endorsement for a new hardware that comes in and we know that this is gonna be like a work in progress because new chips are coming all the time they have different constraints and they don't want to listen to one standard but we want to be like some reference point to where we can compare oh this new quantization scheme how does it compare to this so the goal with this is for a bunch of CPU reference ops that have been tried on many models and we understand them to some extent so so this is a bit more detail on how we actually do the quantization so the bottom number line is the floating-point scale and that histogram is a pretend distribution of values in a particular tensor and the idea of quantization is instead of wasting all our bits representing this range that we don't even use let's figure out only the part that the histogram lies in and only represent that with a smaller number of bits so the top number line is the integer equivalent of that where we took that histogram and we just use these 255 buckets to use to represent the number so this is just that same affine equation we we at inference time we actually have to we changed this min max to two different things called scale and zero point and scale is the floating point size of every bucket and zero point is the int and integer value that corresponds exactly to floating point zero and this turns out to be really important so you started and do this and it resulted in a lot of bias issues with for every x q have you if you don't represent zero exactly you just push this bias and then it also has a convenient thing of oftentimes in models we do padding and it's like just zero is this a special number that we have to represent but the main thing is the q plane thing so this just to give some insight into what these tools are actually why do we need the information so we won't go too much into depth here but here's the summary of our quantitation spec and we have perch axisymmetric weights per layer asymmetric activations and then the zero point is all all these things are in a signed integer value and I'll explain each of these actually because right now they will make any sense so the first part of the specification is cemetry and the idea here is do you want to make your scale be able to represent values that are really not centered around zero and this means often that that zero point that it I'll go back to the question real quick that zero point here do we want to have the cost of that addition and depending on where this thing happens in your math it can be really expensive or not too expensive and so for symmetry we've decided to make weights symmetric and the reason is that since weights are constants the zero point is multiplied by the dynamic activations when that activations are dynamic so this is a cost that you'd have to do that's dependent on the input every time so having weights be asymmetric every inference has a cost that's additional and so weights being symmetric avoids this whole 0 plane multiplication with activations and yeah let me we cannot answer more later but I won't go too much in depth here so it's faster if we make weight symmetric and activations they're only multiplied by a constant value so having them have this zero point is not too expensive so we leave them asymmetric and they activations are often like real loose and stuff which are super asymmetric so we'd be throwing away a bit don't do that cool so the second thing we can play around with in quantization is the granularity in which we decide to have these min maxes or scales and traditionally we were doing per layer quantizer or per tensor quantization for given sir you only have 1 min max but it turns out for convolutions and deprives convolutions often each channel of the convolution has a really different distribution and when you only have one scale or one min max for the entire tensor you're doing a really poor job in each of these distributions so the idea of per channel condensation is you have a min max per channel and since this is not in the inner loop of your kernels it's really not too expensive and gets a huge benefit in accuracy effectively like an extra bit so now to the tools so so in the tool fragmentation is all how do we get these min/max values that we need to do the quantization and so for weights it's super easy weights are static so we can anytime just look at the weights read the min max and quantize using those min max so the problem always comes in dynamic values and activations that you can only get an idea of the distribution by actually running realistic inputs so the first like most naive simplest idea on how to do quantization is let's read the min max at the second we know it which is right at inference so during runtime our graph is actually different before are expensive multiplies our math Mo's we take the float input value measure the min max use those to quantize on-the-fly so this is like a Oh n operation of quantizing on the fly then get the speed-up of doing an int 8 + int 8 multiply on your math model and then go back to float at the end so the idea is here is you get the most realistic min max range for your activations because you're using the one that this for this particular inference the flaws are that you can only really do this on chips that have float support the second time we could do this is if we want the whole graph to be integer we don't we want to avoid this runtime cost of measuring the min max because we don't want any float on any edge of this graph so what we can do is simply move that to compile and so you have your float model and we want to do some post training figuring out of what the min/max values are for all these dynamic values so to do this we need some representative data that we can run through the model collect ranges then and then fix those min max but if it makes min/max values for the activations and this means that we're not using the perfect min max like we were for hybrid quantization before it's but we are working on getting a representative one and we never have to have float in our inference graph so this can go to all those integer accelerators I had a question kind of flip to the previous slide so the choice of whether to do hybrid or not is that multifaceted like based on improving accuracy because now you get better min maxes but also the hardware needs to support the bar fluid biases right yeah so it's really problem specific so we'll get a little bit into that later as well but the short answer is yes it's multifaceted again that it's usually it's a good choice if you're going to CPU it's a bad choice if you have models that have large activations like image models don't get a huge benefit from hybrid because your cost of doing this on the fly quantization is pretty big and then and then accuracy really improves for models with small activations because you're kind of getting a more representative range for that small tensor and also if you want truly low latency influence maybe it's very good yeah yeah it can be and it really depends on the model so I think we have some specific numbers but it's it's really shines and models that are kind of memory bound because you're your main cost of this Ncube thing your activations may not be too big but you're getting this huge benefit of really driving that matte wall so then the third tool is integer only quantization or during training integer link quantization so this is there's results in the same compatible graph as that post trading integer quantization in the previous slide but the differences we're doing that introducing the quantitation into the training that we talked about before so we're working on Karis API stimulus so the way this looks in this the way this will look and is you build your model as before and you just wrap it in this quantized wrapper and they'll be this parameters - we won't go into much detail for hybrid quantization the way it looks as you train your normal graph for tensor flow and then you just quote you enable a flag and the TF light converter so right now we have hybrid and the post-training only enabled in TF light because we want to we wanna make in general but right now we only have specifics on the hardware capabilities of TF light at the moment then we need to know these two people to do this so the way this looks is your normal TF light converter invocation and you just add this optimizations default flag and under the hood this is just doing this hybrid quanta just promising all the weights and leaving activations in float also performance so first off all these approaches gets similar model size reduction and that you're simply taking 32 bits going to 8 bits so you're getting a 4x reduction in size for latency like here we see the we do get a speed-up in these image models but for a lot of them we don't see too much of the speed up as we would expect in quantization and it's because we that on the fly cost is actually pretty high at the hardware is just like this all CPU yeah so like on accelerators this will be the integer ones will like really shine and they custom accelerators so accuracy we do see an accuracy drop and in a lot these models I know a lot of this we are working on ways to God like nudge weights at different times during compilation to fix these accuracy issues and yeah so all these this is not like the gold standard and what quantization can get in these techniques it's the just a starting point so yeah Forex reduction you see a 10 to 50 percent increase in CPU and convolution models on CPU and then for memory bound models you really see a lot more and you often get most of the bang of the buck for of quantization from hybrid in those models versus needing the full integer that being said for accelerators you still need to go the full integer round so post-reading integer quantization so this is also enabled into your flight you train the intensive Phyllis normal way you would a float graph and then you provide more one more option into the converter the way that looks is you do the same Flags before operation and default but now we need some data to figure out those dynamic ranges at compile time rather than at the run time so this data generator you provide needs to yield examples that you would expect to see in practice and so for my image models we just just ground a few images from image net and usually we see like a couple hundred works well enough but it's probably very problem specific so under the hood this is doing that post training partition where we measure the absolute min and absolute max we see for particular activations used I mean like ultimately you still have inferences coming in so even if maybe the first ones like the first a thousand Sloane right like you know after a thousand you definitely have those statistics why would you ever not just then at that point that's a good question yeah and you could do that and so oftentimes they turns out these for like the arm end models it you actually get an accuracy benefit from hybrid which because if you had a bunch of data even they had a bunch because the each activation actually is getting a really unique brain because it's bloated yeah and and also because like you can imagine and RNN that same mob is actually gonna change its distribution based on which time step you're on and so it really ends up being problem specific there but you're right for like image models we absolutely could be doing that so yeah the example represent that said it's just how you would normally load data and you just yield examples of these images so now some numbers so before we had released this the content quantizer you writer which I'm not talking about in this talk because it's deprecated for a more friendly like 2.0 capable API and so the but those are kind of the gold standard in quantization accuracy numbers for these image classification problems and what we've seen is that with these changes of per channel into our quantization scheme post-training integer quantization which is the right column gets pretty comparable on all these models that matter and this is without anything fancy so Denalis been looking into a lot of cool tricks or figuring out how to get where the accuracy is going and post-training so these numbers should be improving as well but the takeaway here is that most things 8-bit maybe we're good enough with post-training and only the experts really need to use quantization of our training so this is an example of quantization not working well is the first column where SSD it's the same base structure of like mobile net but you're what you're doing with your logits is a lot more so quantizing actually introduces a lot more error here and we see over a percent jump in post training versus quantization aware training and this higher better is wrong yes so the other two columns are new models and no one ever went about doing partition we're training here because it was just too much work and because they tried post training these were released after post trainers least and post training did really well accuracy wise so they just didn't bother with quantization we're training more models stock transfer we got good results in conversation although there's not really a good metric for stop transfer the metrics like look at it and it looks good enough and then some speech models do really good everything's great so the order so the benefit of post training integer quantization is similar size reduction similar speed-up on CPU and someone speed-up on a CPU for our nuns and convolutions even better for accomplishment because you don't have this cost but the main thing this enables is all these integer microcontrollers all these integer accelerators can now we can run on them so here's the summary of the three tools and the flow should usually look like you try hybrid you see how you get on CPU if you want to go to an accelerator or you want more on CPU you do the post training where you just add some represent data set and then only as a last resort once you see like post training not getting good accuracy for you try clinician where training so similarly we have tools for connection pruning where which are during training techniques and so they have a similar API to the quantization aware training API and so the flow usually looks you build your cares model you apply pruning in the API you train and often these pruning api's are doing a lot less like they're they're very localized to your weights so they're not really tearing apart your grass like quantization is and like Pulkit can attest to this we're like the pruning was like a lot simpler implementation wise than constant of our training because for training for quantization you have to understand all the fusions of your back end and whereas pruning is local to the weights and so the flow here is you train like normal and your resulting graph has many tensors that have lots of zeros and right now the flow is that you can compress your file and it's smaller and in the future we're working on Tesla light runtime support for these sparse tensors and kernel supports so you'll additionally you'll get out of the box size reduction instead of having to do this manual compression and you'll get speed up with for the sparsity in the future when you say that it might be fascinating that in the case of structured sparse to get what you for some kind of processing or is it for arbitrary possibly but it's silly it really yeah so this is something where we're trying to figure out like two things like those particular questions for given Hardware what do we want and how do we expose this in the way that makes sense when all this there's like so much fragmentation for hardware so and problems so like for certain problems you if you do arbitrary sparsity you probably need like 99.9% sparsity to get a speed-up on a particularly hardware and for CPUs and particular speech models we've already been doing structured sparsity with certain block sizes like you're saying and yeah so this training tool has the ability to set your block size and right now we're working on where we need to work on in the future for given the hardware what is the standard block size you need for that and so yeah you're absolutely right there's there's fragmentation to is it's like well the problem will allow this levels of sparsity that you desire and is the hardware you target going to support that so yeah for CPU usually we need a block so the API here is similar to the quantization API you provide parameters on how you you're scheduled for how you want to quantize and here that final sparsity is an important number it's basically saying at the end of training how many values and all your weights do you want to be zero so yeah easy to use try the coverage we found is very it works on a lot of models it seems to be a very general technique and as I said before it works really well with quantization as well yeah so here's a graph that kind of confusing graph but it's how your accuracy is affected on mobile net this is an example based on how much pruning you do and what we notice is that there's often a lot of pruning you get for free and then there's a sudden cliff so the goal here is to for your problem to play with the parameters and figure out where is that sounding good cliff or where do you want to lie on this curve so here we see around like 75 ish percent you're doing pretty good until then what do you what technique was useful yeah so here we do pruning based on the low magnitude values so there's like a mask and then you update that occasionally and the mask is updated based on which values in your tensor are closer to zero so more numbers works great skip and then yeah so in summary cloud edition is hard because it is problem specific hardware specific and the tools have lots of trade-offs depending on which problem with which hardware and then pruning were starting to get into the space of accelerating pruning and there right now they're great it's a great technique for reducing model size and we need to explore how that's going to look for various hardware is how are we going to expose this in a general way so otherwise just any questions I can answer it so how does like to see if you actually do the like matrix multiplication given to int inputs with a min and Max for each yeah so the way it looks I don't know if I have anything to look at maybe just open for code all right yeah I'll try that so so do it I'll say the words first if it doesn't make sense I can try to find something so the way it actually looks at the works at inference so let's ignore 0.4 am one because it just gets in the way okay so you're see we're just doing a matrix multiplication so your input has a certain range which corresponds to a particular scale your second input your weight has a certain range which corresponds to a certain scale so you have one scale another scale and then your output has a third independent range so the third scale so what we do is your intake multivator some application actually gets accumulated into an-32 values okay so if you imagine that all those in 32 values in the accumulator they have an implicit scale because you just multiplied they have an implicit scale of these two scales multiplied right if you wanted to recover their a float value from these in 32 values you just multiplied by these two scales weight so we haven't so this is not how it actually works but I'm just explaining the math so and these so then our goal is to eventually output intake values that lie on the output scale so what we do in practice is we want to get from this in 32 value that has an implicit skill from of this scale on the scale s1 and s2 and go to s3 so we just multiply by s 3 and divided by s 1 and s 2 so we make a new scale that's those three values that that quit fraction so that's how they so it's in practice the inference just looks like in tape times in Tate in 30 to do this once rescale which is this s 3 over s 1 s 2 and then you're after ain't age value if it's that make sense I can yeah yes I do have to do like an integer division yeah and so we that that rescale is a floating point value right so and we don't want to do that so we all we do decompose that into two integers and sometimes the shift depending if you're like sometimes your target only supports power of two scales because they just wants to implement that as a shift so there's like lots of that's a holy another thing where you there's lots of ways to implement that rescale its trade-offs so what T you find does by default is we decompose it into two integers and do like a like we almost emulate float in training is that is that consistent I mean with an in training sleeve that you described there is something over no so yeah there's a lot of techniques that we we need to start including and so right now these techniques have been these kind of ant and get something working type techniques where first there was no training so we want the quantization of our training and then but yeah more and more like Wednesday I think you're talking like there's ghosts like Westie where the idea is if you're given that particular min/max what is the perfect range a perfect distribution of values such two inch decrease audition error and the answer is a uniform distribution so Wesley tries to do this by introducing a loss into your training and yeah we these things all are compatible with when you train in the fluid model but they're not offering on the box because we have noticed like things in some of my experiments I noticed that like at Wesley only works well for a particular model after you've trained for a bit or like it's still we still don't have like general knowledge on when exactly to use it so we should be offering all of these and we plan to in this toolkit that's like choices for users look at that's great good you mentioned so there's it straining and post-training kind of techniques and then also you can do it hybrid or you can do pure int so in this quadrant one kind of option was missing so you didn't you showed like three examples but you kind of implied that you wouldn't be doing in training quantization combined with the hybrid why is that you you're absolutely right and it's just that the tooling is not doing that right now but but that's exactly the way direction we want to go that use get some metrics on what it error the quantization is using and use that to drive things like should we be doing one or the other should we be doing eight bits or should be for this one tensor it doesn't make sense to leave it in float does it make sense to bump it up to sixteen bit but that's absolutely right where there's nothing if you and K wrong with it I'm just another option that you will actually end okay and for context like what we've added now is that if you have ops in your graph that don't support quantization we just leave on the float so we're already starting to get in the direction of like partial quantization or but that's exactly a direction and what the piece that's kind of missing is this these two information hooks where one is what does quantization doing to your problem tasks like your error for your actual problem we can get things like to noise ratio but oftentimes that's not too representative of what's it doing to your task problem so so one thing we need is like for this op what is it doing to the problem and then we can make decisions like this and the other thing is some pluggable specification of hardware that says for this hardware doesn't even support vibrator because then it's not an option but yeah that's exactly the what we need to do [Music]

Original Description

Take an inside look into the TensorFlow team’s own internal training sessions--technical deep dives into TensorFlow by the very people who are building it! In this episode of Inside TensorFlow, Software Engineer Suharsh Sivakumar discusses the TensorFlow Model Optimization Toolkit, with a concentration in quantization and pruning. Let us know what you think about this presentation in the comments below! Watch more from Inside TensorFlow Playlist → https://goo.gle/Inside-TensorFlow Subscribe to the TensorFlow channel → https://goo.gle/TensorFlow

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from TensorFlow · TensorFlow · 0 of 60

← Previous Next →

The TensorFlow YouTube Channel is Here!

The TensorFlow YouTube Channel is Here!

Answering Your TF Questions #AskTensorFlow

Answering Your TF Questions #AskTensorFlow

Chatting With the TensorFlow Community (TensorFlow Meets)

Chatting With the TensorFlow Community (TensorFlow Meets)

All About TensorFlow Code (Coding TensorFlow)

All About TensorFlow Code (Coding TensorFlow)

TensorFlow: an ML platform for solving impactful and challenging problems

TensorFlow: an ML platform for solving impactful and challenging problems

Keynote (TensorFlow Dev Summit 2018)

Keynote (TensorFlow Dev Summit 2018)

tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)

tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)

Eager Execution (TensorFlow Dev Summit 2018)

Eager Execution (TensorFlow Dev Summit 2018)

Machine Learning in JavaScript (TensorFlow Dev Summit 2018)

Machine Learning in JavaScript (TensorFlow Dev Summit 2018)

Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)

Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)

The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)

The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)

Distributed TensorFlow (TensorFlow Dev Summit 2018)

Distributed TensorFlow (TensorFlow Dev Summit 2018)

Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)

Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)

TensorFlow Lite (TensorFlow Dev Summit 2018)

TensorFlow Lite (TensorFlow Dev Summit 2018)

Searching Over Ideas (TensorFlow Dev Summit 2018)

Searching Over Ideas (TensorFlow Dev Summit 2018)

Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)

Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)

Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)

Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)

Open Source Collaboration (TensorFlow Dev Summit 2018)

Open Source Collaboration (TensorFlow Dev Summit 2018)

Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)

Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)

TensorFlow Hub (TensorFlow Dev Summit 2018)

TensorFlow Hub (TensorFlow Dev Summit 2018)

Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)

Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)

Real-World Robot Learning (TensorFlow Dev Summit 2018)

Real-World Robot Learning (TensorFlow Dev Summit 2018)

TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)

TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)

Project Magenta (TensorFlow Dev Summit 2018)

Project Magenta (TensorFlow Dev Summit 2018)

TensorFlow Dev Summit 2018 - Livestream

TensorFlow Dev Summit 2018 - Livestream

Introducing TensorFlow Lite (Coding TensorFlow)

Introducing TensorFlow Lite (Coding TensorFlow)

TensorFlow Dev Summit 2018 Highlights

TensorFlow Dev Summit 2018 Highlights

Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)

Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)

TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow

TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow

Using TensorFlow to enable research & production across many fields (TensorFlow Meets)

Using TensorFlow to enable research & production across many fields (TensorFlow Meets)

Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)

Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)

TensorFlow Lite for Android (Coding TensorFlow)

TensorFlow Lite for Android (Coding TensorFlow)

Using the tf.data API to build input pipelines (TensorFlow Meets)

Using the tf.data API to build input pipelines (TensorFlow Meets)

Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow

Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow

Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)

Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)

TensorFlow Lite for iOS (Coding TensorFlow)

TensorFlow Lite for iOS (Coding TensorFlow)

Get started with TensorFlow's High-Level APIs (Google I/O '18)

Get started with TensorFlow's High-Level APIs (Google I/O '18)

TensorFlow for JavaScript (Google I/O '18)

TensorFlow for JavaScript (Google I/O '18)

TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)

TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)

Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018

Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018

TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)

TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)

TensorFlow Lite for mobile developers (Google I/O '18)

TensorFlow Lite for mobile developers (Google I/O '18)

Advances in machine learning and TensorFlow (Google I/O '18)

Advances in machine learning and TensorFlow (Google I/O '18)

Distributed TensorFlow training (Google I/O '18)

Distributed TensorFlow training (Google I/O '18)

Classification using neural networks & ML regression models #AskTensorFlow

Classification using neural networks & ML regression models #AskTensorFlow

TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)

TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)

Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)

Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)

How to get started with AI/ML, retraining models, & more! #AskTensorFlow

How to get started with AI/ML, retraining models, & more! #AskTensorFlow

TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)

TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)

MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)

MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)

The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)

The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)

At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)

At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)

NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)

NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)

Try TensorFlow.js in your browser (Coding TensorFlow)

Try TensorFlow.js in your browser (Coding TensorFlow)

TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)

TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)

How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)

How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)

Training models faster with TensorFlow Hub (TensorFlow Meets)

Training models faster with TensorFlow Hub (TensorFlow Meets)

Prepare your dataset for machine learning (Coding TensorFlow)

Prepare your dataset for machine learning (Coding TensorFlow)

Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)

Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)

TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)

TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)

This video provides an in-depth look at the TensorFlow Model Optimization Toolkit, covering quantization and pruning techniques to improve model performance and reduce size. It discusses various aspects of quantization and pruning, including fine quantization, integer quantization, and hybrid quantization.

Key Takeaways

Build your model as before and wrap it in a quantized wrapper
Train your normal graph for TensorFlow and enable a flag for hybrid quantization
Invoke the TF Light Converter with the optimizations default flag
Provide data to measure dynamic ranges at compile time for post-training integer quantization
Use the TF Model Optimization Toolkit to reduce model size and improve inference speed

💡 Quantization and pruning are compatible techniques that can be used together to improve model performance and reduce size, and the TF Model Optimization Toolkit provides a suite of tools to achieve this.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

DevOps Took 10 Years to Mature.

MLOps is distinct from DevOps and solves unique problems, requiring a different approach

Medium · DevOps

Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI

Learn how Praesto, a Kubernetes Operator, optimizes ML model caching for Node-Local storage with CSI, reducing costs and improving performance

Medium · DevOps

Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx

Learn to deploy DeepSeek R1 with vLLM and Nginx for production-ready environments, moving beyond local development

Dev.to · Shannon Dias

MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages

Learn to build production monitoring for your MCP server to minimize outages and ensure smooth operation

Pole Pruner How A Rope Lever Shears High Branches

Innoforge Studio