Deep Compression
Key Takeaways
Deep Compression technique reduces the file size of deep neural networks through a pipeline of Pruning, Quantization, and Huffman Encoding, as demonstrated on AlexNet and VGG-16 models.
Full Transcript
[Music] this video will explain deep compression deep compression is a technique to dramatically reduce the file size of deep neural networks the results of this paper reduce Alex Net by 35 times this results in a reduction from 240 megabytes to 6.9 they also reduced vgg 16 by a factor of 49 from 552 megabytes to 11.3 this is really interesting for mobile developers especially because if you have an app which is over a hundred megabytes you'll get the user will get this extra notification and maybe won't download your app in addition having these large file sizes in your apps results in higher energy consumption and larger file size when people are looking to clear some space on their phones so the way that deep compression works is a three-stage compression pipeline of pruning quantization and Huffman coding pruning works by reducing the number of connections or weights by nine times thirty thirteen times in this paper the way that they do this is that they go through the weights in the neural network and they remove the weights that are below a certain threshold so for example if the neural network weight is zero or maybe zero point one or negative zero point one it would be completely masked out and they would no longer use this in matrix multiplications that compute the neural network output so after they prune the weights they store it in this compressed sparse row format in order to save space and what this does is it indexes the index difference between the weights that haven't been masked out after pruning they quantize the weights and this is the most interesting idea to me quantization is this idea of weight sharing and it can really speed up this training and storage requirement so what they would do is they would use a k-means algorithm to cluster the weights into similar groups so for example here all the weights that are labeled to green map 2-1 the weights would label is orange back to zero and so on so what they do is they now can represent all the weights in the network with like three bits because they're only using say like if you have three bits you can represent eight weights to to the three and they would set every way in the network to be one of these quantized weights from the codebook then what they would do is they would train the network and they would still do the partial derivative with respect to each way even though they all show the same way but then they're gonna aggregate each of these like the partial derivative that explains how much the codebook contributed to each of the individual losses and they're gonna aggregate this to effectively train the codebook so it would be initial centroids are derived after the full training and then they're fine-tuned after they're quantized so in quantizing an alux net they use eight bits for the convolutional layers and five bits for the fully connected layers so using five bits in the quantization means that you have thirty-two different values for the weights that can be used in the neural network and amazingly this doesn't have any loss of accuracy when they do it in this experiment so again quantization reduces the number of bits in the fully connected layers from 32 bits meaning that there are two to the 32 different values the weights can take on to five meaning 32 values they can take on so this equation right here shows how the compression rate when using quantization this plot shows how the code books to how the weights tend to be distributed and it's pretty interesting to see that they thought that they're in this bimodal structure where there are basically two gaussians next to each other for the distribution of the weights in the deep neural network so once they have quantized the weights they turn to Huffman encoding and Huffman encoding is a technique used in data compression to basically take advantage of the bias distribution values so for example in this bimodal distribution most of the values lie at the peak so what you might do is you would use like zero or zero one or something like that to encode for this really frequent occurring value and then you would use the longer bits to it to encode the rare values and this this is like a classical technique in data compression that really reduces file size and things like JPEG or you know PNG stuff like that so these are the results across different networks that they compress using the three-stage pipeline of pruning quantization at Huffman they're all pretty amazing compression conversion rates especially when you look at the number of parameters and the size of the file so these are the results on the alux net datasets ALX net model the first column shows the percentage of the ways that are pruned then the weight bits after pruning post quantization and then it shows how the pruning quantization Huffman coding interact together to really achieve massive compression rate thanks for watching this video on deep compression the paper link is provided in the description please subscribe to Henry AI labs for more deep learning videos [Music]
Original Description
This video will explain Deep Compression in Deep Neural Networks! This technique reduces the size of VGG-16 from 552 MB to 11.3 MB through a pipeline of Pruning, Quantization, and Huffman Encoding!
Thanks for watching, Please Subscribe!
Paper Link: https://arxiv.org/abs/1510.00149
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Connor Shorten · Connor Shorten · 29 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
▶
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
DenseNets
Connor Shorten
DeepWalk Explained
Connor Shorten
Inception Network Explained
Connor Shorten
StackGAN
Connor Shorten
StyleGAN
Connor Shorten
Progressive Growing of GANs Explained
Connor Shorten
Improved Techniques for Training GANs
Connor Shorten
Word2Vec Explained
Connor Shorten
Must Read Papers on GANs
Connor Shorten
Unsupervised Feature Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Embedding Graphs with Deep Learning
Connor Shorten
Transfer Learning in GANs
Connor Shorten
ReLU Activation Function
Connor Shorten
AC-GAN Explained
Connor Shorten
SimGAN Explained
Connor Shorten
DC-GAN Explained!
Connor Shorten
ResNet Explained!
Connor Shorten
Graph Convolutional Networks
Connor Shorten
Neural Architecture Search
Connor Shorten
Henry AI Labs
Connor Shorten
Video Classification with Deep Learning
Connor Shorten
BigGANs in Data Augmentation
Connor Shorten
Introduction to Deep Learning
Connor Shorten
EfficientNet Explained!
Connor Shorten
Self-Attention GAN
Connor Shorten
Curriculum Learning in Deep Neural Networks
Connor Shorten
Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging
Connor Shorten
Deep Compression
Connor Shorten
Skin Cancer Classification with Deep Learning
Connor Shorten
Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging
Connor Shorten
The Lottery Ticket Hypothesis Explained!
Connor Shorten
SqueezeNet
Connor Shorten
GauGAN Explained!
Connor Shorten
AutoML with Hyperband
Connor Shorten
DL Podcast #3 | Yannic Kilcher | Population-Based Search
Connor Shorten
Weakly Supervised Pretraining
Connor Shorten
Image Data Augmentation for Deep Learning
Connor Shorten
Unsupervised Data Augmentation
Connor Shorten
Wide ResNet Explained!
Connor Shorten
RevNet: Backpropagation without Storing Activations
Connor Shorten
GANs with Fewer Labels
Connor Shorten
BigBiGAN Unsupervised Learning!
Connor Shorten
Self-Supervised Learning
Connor Shorten
Multi-Task Self-Supervised Learning
Connor Shorten
Self-Supervised GANs
Connor Shorten
Population Based Training
Connor Shorten
Show, Attend and Tell
Connor Shorten
Siamese Neural Networks
Connor Shorten
WaveGAN Explained!
Connor Shorten
VAE-GAN Explained!
Connor Shorten
Evolution in Neural Architecture Search!
Connor Shorten
AI Research Weekly Update August 18th, 2019
Connor Shorten
Weight Agnostic Neural Networks Explained!
Connor Shorten
AI Research Weekly Update August 25th, 2019
Connor Shorten
Neuroevolution of Augmenting Topologies (NEAT)
Connor Shorten
CoDeepNEAT
Connor Shorten
AI Research Weekly Update September 1st, 2019
Connor Shorten
Randomly Wired Neural Networks
Connor Shorten
Genetic CNN
Connor Shorten
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI