Deep Compression

Connor Shorten · Advanced ·📄 Research Papers Explained ·7y ago

Skills: Reading ML Papers80%LLMOps70%ML Pipelines60%

Key Takeaways

Deep Compression technique reduces the file size of deep neural networks through a pipeline of Pruning, Quantization, and Huffman Encoding, as demonstrated on AlexNet and VGG-16 models.

Full Transcript

[Music] this video will explain deep compression deep compression is a technique to dramatically reduce the file size of deep neural networks the results of this paper reduce Alex Net by 35 times this results in a reduction from 240 megabytes to 6.9 they also reduced vgg 16 by a factor of 49 from 552 megabytes to 11.3 this is really interesting for mobile developers especially because if you have an app which is over a hundred megabytes you'll get the user will get this extra notification and maybe won't download your app in addition having these large file sizes in your apps results in higher energy consumption and larger file size when people are looking to clear some space on their phones so the way that deep compression works is a three-stage compression pipeline of pruning quantization and Huffman coding pruning works by reducing the number of connections or weights by nine times thirty thirteen times in this paper the way that they do this is that they go through the weights in the neural network and they remove the weights that are below a certain threshold so for example if the neural network weight is zero or maybe zero point one or negative zero point one it would be completely masked out and they would no longer use this in matrix multiplications that compute the neural network output so after they prune the weights they store it in this compressed sparse row format in order to save space and what this does is it indexes the index difference between the weights that haven't been masked out after pruning they quantize the weights and this is the most interesting idea to me quantization is this idea of weight sharing and it can really speed up this training and storage requirement so what they would do is they would use a k-means algorithm to cluster the weights into similar groups so for example here all the weights that are labeled to green map 2-1 the weights would label is orange back to zero and so on so what they do is they now can represent all the weights in the network with like three bits because they're only using say like if you have three bits you can represent eight weights to to the three and they would set every way in the network to be one of these quantized weights from the codebook then what they would do is they would train the network and they would still do the partial derivative with respect to each way even though they all show the same way but then they're gonna aggregate each of these like the partial derivative that explains how much the codebook contributed to each of the individual losses and they're gonna aggregate this to effectively train the codebook so it would be initial centroids are derived after the full training and then they're fine-tuned after they're quantized so in quantizing an alux net they use eight bits for the convolutional layers and five bits for the fully connected layers so using five bits in the quantization means that you have thirty-two different values for the weights that can be used in the neural network and amazingly this doesn't have any loss of accuracy when they do it in this experiment so again quantization reduces the number of bits in the fully connected layers from 32 bits meaning that there are two to the 32 different values the weights can take on to five meaning 32 values they can take on so this equation right here shows how the compression rate when using quantization this plot shows how the code books to how the weights tend to be distributed and it's pretty interesting to see that they thought that they're in this bimodal structure where there are basically two gaussians next to each other for the distribution of the weights in the deep neural network so once they have quantized the weights they turn to Huffman encoding and Huffman encoding is a technique used in data compression to basically take advantage of the bias distribution values so for example in this bimodal distribution most of the values lie at the peak so what you might do is you would use like zero or zero one or something like that to encode for this really frequent occurring value and then you would use the longer bits to it to encode the rare values and this this is like a classical technique in data compression that really reduces file size and things like JPEG or you know PNG stuff like that so these are the results across different networks that they compress using the three-stage pipeline of pruning quantization at Huffman they're all pretty amazing compression conversion rates especially when you look at the number of parameters and the size of the file so these are the results on the alux net datasets ALX net model the first column shows the percentage of the ways that are pruned then the weight bits after pruning post quantization and then it shows how the pruning quantization Huffman coding interact together to really achieve massive compression rate thanks for watching this video on deep compression the paper link is provided in the description please subscribe to Henry AI labs for more deep learning videos [Music]

Original Description

This video will explain Deep Compression in Deep Neural Networks! This technique reduces the size of VGG-16 from 552 MB to 11.3 MB through a pipeline of Pruning, Quantization, and Huffman Encoding! Thanks for watching, Please Subscribe! Paper Link: https://arxiv.org/abs/1510.00149

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Connor Shorten · Connor Shorten · 29 of 60

← Previous Next →

DeepWalk Explained

DeepWalk Explained

Inception Network Explained

Inception Network Explained

Progressive Growing of GANs Explained

Progressive Growing of GANs Explained

Improved Techniques for Training GANs

Improved Techniques for Training GANs

Word2Vec Explained

Word2Vec Explained

Must Read Papers on GANs

Must Read Papers on GANs

Unsupervised Feature Learning

Unsupervised Feature Learning

Self-Supervised GANs

Self-Supervised GANs

Embedding Graphs with Deep Learning

Embedding Graphs with Deep Learning

Transfer Learning in GANs

Transfer Learning in GANs

ReLU Activation Function

ReLU Activation Function

AC-GAN Explained

AC-GAN Explained

SimGAN Explained

SimGAN Explained

DC-GAN Explained!

DC-GAN Explained!

ResNet Explained!

ResNet Explained!

Graph Convolutional Networks

Graph Convolutional Networks

Neural Architecture Search

Neural Architecture Search

Video Classification with Deep Learning

Video Classification with Deep Learning

BigGANs in Data Augmentation

BigGANs in Data Augmentation

Introduction to Deep Learning

Introduction to Deep Learning

EfficientNet Explained!

EfficientNet Explained!

Self-Attention GAN

Self-Attention GAN

Curriculum Learning in Deep Neural Networks

Curriculum Learning in Deep Neural Networks

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Learning Podcast #1 | Edward Dixon | Stochastic Weight Averaging

Deep Compression

Deep Compression

Skin Cancer Classification with Deep Learning

Skin Cancer Classification with Deep Learning

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

Deep Learning Podcast #2 | Edward Peake | Deep Learning in Medical Imaging

The Lottery Ticket Hypothesis Explained!

The Lottery Ticket Hypothesis Explained!

GauGAN Explained!

GauGAN Explained!

AutoML with Hyperband

AutoML with Hyperband

DL Podcast #3 | Yannic Kilcher | Population-Based Search

DL Podcast #3 | Yannic Kilcher | Population-Based Search

Weakly Supervised Pretraining

Weakly Supervised Pretraining

Image Data Augmentation for Deep Learning

Image Data Augmentation for Deep Learning

Unsupervised Data Augmentation

Unsupervised Data Augmentation

Wide ResNet Explained!

Wide ResNet Explained!

RevNet: Backpropagation without Storing Activations

RevNet: Backpropagation without Storing Activations

GANs with Fewer Labels

GANs with Fewer Labels

BigBiGAN Unsupervised Learning!

BigBiGAN Unsupervised Learning!

Self-Supervised Learning

Self-Supervised Learning

Multi-Task Self-Supervised Learning

Multi-Task Self-Supervised Learning

Self-Supervised GANs

Self-Supervised GANs

Population Based Training

Population Based Training

Show, Attend and Tell

Show, Attend and Tell

Siamese Neural Networks

Siamese Neural Networks

WaveGAN Explained!

WaveGAN Explained!

VAE-GAN Explained!

VAE-GAN Explained!

Evolution in Neural Architecture Search!

Evolution in Neural Architecture Search!

AI Research Weekly Update August 18th, 2019

AI Research Weekly Update August 18th, 2019

Weight Agnostic Neural Networks Explained!

Weight Agnostic Neural Networks Explained!

AI Research Weekly Update August 25th, 2019

AI Research Weekly Update August 25th, 2019

Neuroevolution of Augmenting Topologies (NEAT)

Neuroevolution of Augmenting Topologies (NEAT)

AI Research Weekly Update September 1st, 2019

AI Research Weekly Update September 1st, 2019

Randomly Wired Neural Networks

Randomly Wired Neural Networks

This video explains the Deep Compression technique, which reduces the file size of deep neural networks through a pipeline of Pruning, Quantization, and Huffman Encoding. The technique is demonstrated on AlexNet and VGG-16 models, achieving significant compression rates without loss of accuracy.

Key Takeaways

Prune neural network weights to reduce connections
Quantize weights using k-means algorithm
Apply Huffman encoding to compressed weights
Evaluate compression rates and accuracy

💡 The Deep Compression technique can achieve significant compression rates without loss of accuracy by leveraging the structure of neural network weights.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling