Depthwise Separable Convolution - A FASTER CONVOLUTION!

CodeEmporium · Advanced ·📄 Research Papers Explained ·8y ago

Key Takeaways

The video discusses Depthwise Separable Convolution, a faster method of convolution with less computation power and parameters, and its applications in modern research, including neural network architectures like MobileNets and multi-model networks.

Full Transcript

convolution is a measure of overlap between two functions as one slides over the other mathematically it's a sum of products the standard convolution operation is slow to perform however we can speed this up with an alternative method that is the topic of this video depth wise separable convolution let's first very quickly go over the basics of convolution on an input volume consider an input volume f.o shape d f cross d f cross m where DF is the width and height of the input volume and M is the number of input channels if a color image was an input then M would be equal to 3/4 the RG and B channels we apply convolution on a kernel K of shape DK cross DK cross M this will give us an output of shape D G cross DG cross 1 if we apply n such kernels on the input then we get an output volume G of shape DG cross DG cross n the convolution operation takes the sum of products of the input and the kernel to return a scalar this operation is continued by sliding the kernel over the input I've explained this concept in detail on my video on convolution neural networks check that out for a clear understanding I'm more concerned now with the cost of this convolution operation so let's take a look at that we can measure the computation required for convolution by taking a look at the number of multiplications required so why is that it's because multiplication is an expensive operation relative to addition so let's determine the number of multiplications for one convolution operation the number of multiplications is the number of elements in that kernel so that would be D K times D K times M multiplications but we slide this kernel over the input we perform DG convolutions along the width and DG convolutions along the height and hence D G cross DG convolutions over all so the number of multiplications in the convolution of one kernel over the entire input f is DG square times D K square times M now this is for just one kernel but if we have n such kernels which makes the absolute total number of multiplications become n times D G square times D K square times M multiplications let's now take a look at depth wise separable convolutions in standard convolution the application of filters across all input channels and the combination of these values are done in a single step def y separable convolution on the other hand breaks us down into two parts the first is depth wise convolution that is it performs the filtering stage and then point wise convolution which performs the combining stage let's get into some details here depth wise convolution applies convolution to a single input channel at a time this is different from the standard convolution that applies convolution to all channels let us take the same input volume F to understand this process F has a shape D F cross D F cross M where D F is the width and height of the input volume and M is the number of input channels like I mentioned before for depth wise convolution we use filters or kernels K of shape DK cross DK cross one here DK is the width and height of the square kernel and it has a depth of 1 because this convolution is only applied to a channel unlike standard convolution which is applied throughout the entire day and since we apply one kernel to a single input channel we require M such DK cross DK cross one kernels over the entire input volume F for each of these M convolutions we end up with an output DG cross DG cross one in shape now stacking these outputs together we have an output volume of G which is of shape DG cross DG cross M this is the end of the first phase that is the end of depth wise convolution now this is succeeded by point wise convolution point wise convolution involves performing the linear combination of each of these layers here the input is the volume of shape DG cross DG cross M the filter K PC has a shape one cross one cross M this is basically a 1 Cross 1 convolution operation over all M layers the output will thus have the same input width and height as the input D G cross DG for each filter assuming that we want to use some n such filters the output volume becomes D G cross DG cross n so that's great we got this down now let's take a look at the complexity of this convolution we can split this into two parts as we have two phases first we compute the number of multiplications in depth wise convolution so here the kernels have a shape DK cross D K cross 1 so the number of multiplications on one convolution operation is all DK times DK DK square when applied over the entire input channel this convolution is performed DG x DG number of times so the number of multiplications for the kernel over the input channel becomes DG square times DK square now such multiplications are applied over all em input channels for each channel we have a different kernel and hence the total number of multiplications in the first phase that is depth wise convolution is M times D G square times D K square next we compute the number of multiplications in the second phase that is point wise convolution here the kernels have a shape one cross one cross M where m is the depth of the input volume and hence the number of multiplications for one instance of convolution is M this is applied to the entire output of the first phase which has a width and height of D G so the total number of multiplications for this kernel is d G times D G times M so for some n kernels will have n times D G times D G times M such multiplications and thus the total number of multiplications is the sum of multiplications in the depth wise convolution stage plus the number of multiplications in the point-wise convolution stage we can take M times D G squared common now we compare the standard convolution with depth wise convolution we get the ratio as the sum of reciprocal of the depth of output volume that is n and the reciprocal of the squared dimensions of the kernel DK to put this into perspective of how effective depth wise convolution is let us take an example so consider the output feature volume n of 1024 and a kernel of size 3 that's DK is equal to 3 plugging these values into the relation we get zero point 1 1 2 in other words standard convolution has 9 times more the number of multiplications as that of depth Y separable convolution this is a lot of computing power we can also quickly compare the number of parameters in both convolutions in standard convolution each kernel has k times D K times M learn about parameters since there are n such kernels there are n times M times D K squared parameters in depth by separable convolutions will split this once again into two parts in the depth wise convolution phase we use M kernels of shape DK cross DK in point wise convolution we use n kernels of shape 1 Cross 1 cross M so the total is M times DK square plus M times n or we can just take M common taking the ratio we get the same ratio as we did for computational power required so we understood exactly what depth wise convolution is and also its computation power with respect to the traditional standard convolution but where exactly has this been used well there are some very interesting papers here the first is on multi model neural networks these are networks designed to solve multiple problems using a single network a multi model network has four parts the first is modality Nets to convert different input types to a universal internal representation then we have an encoder to process inputs we have a mixer to encode inputs with previous outputs and we have a decoder to generate outputs a fundamental component of each of these parts is depth wise separable convolution it works effectively in such large networks next up we have exception a convolution neural network architecture based entirely on depth wise separable convolution layers it has shown the state-of-the-art performance on large datasets like Google's jft image data set it's a repository of 350 million images with 17,000 class labels to put this into perspective the popular image net took 3 days to Train however to Train even a subset of this jft data set it took a month and it didn't even converge in fact it would have approximately taken about three months to converge how'd they let it run to its full length so that's useful this paper is pushing convolution neural networks to use depth Y separable convolution as the de facto up third we have mobile Nets a neural network architecture that strives to minimize latency of smaller scale networks so that computer vision applications run well on mobile devices mobile nets used F Y separable convolutions in its 28 layer architecture this paper compares the performance of mobile nets with fully connected layers versus depth wise separable convolution layers it turns out the accuracy on image net only drops a 1% while using significantly less number of parameters from twenty nine point three million the number of parameters it's down to just 4.2 million we can see the mulch as the number of multiplications and additions which is a direct measure of computation has also significantly decreased for depth by separable convolution mobile Nets so here are some things to remember in this video depth Y separable convolution decreases the computation and number of parameters when compared to standard convolution second is that depth Y separable convolution is a combination of depth wise convolution followed by a point wise convolution depth wise convolution is the filtering step and point wise convolution can be thought of as the combination step finally they have been successfully implemented in neural network architectures like multi model networks exception and mobile nets and that's all I have for you now thank you all for stopping by today if you liked the video hit that like button if you want to stick around hit that subscribe button if you really want to stick around hit that Bell icon next to the subscribe button so as to be notified of my uploads immediately links to important papers are down below so check them out have a good day and I'll see you in the next one bye

Original Description

In this video, I talk about depthwise Separable Convolution - A faster method of convolution with less computation power & parameters. We mathematically prove how it is faster, and discuss applications where it is used in modern research. If you liked that video, hit that like button. If you wanna stick around, hit that subscribe button. If you really wanna stick around, hit that bell icon next to the subscribe button to be notified of my uploads immediately. Convolution Neural Networks: https://www.youtube.com/watch?v=m8pOnJxOcqY REFERENCES Xception (main paper): https://arxiv.org/pdf/1610.02357.pdf Mobile Nets (Efficient CNN for mobile vision applications) : https://arxiv.org/pdf/1704.04861.pdf One model Learns all: https://arxiv.org/pdf/1706.05137v1.pdf Music at : https://www.bensound.com/royalty-free-music/track/tenderness
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 8 of 60

1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
7 Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
39 Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
40 Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video teaches Depthwise Separable Convolution, a faster method of convolution with less computation power and parameters, and its applications in modern research, including neural network architectures like MobileNets and multi-model networks. The video provides a mathematical proof of the efficiency of Depthwise Separable Convolution and discusses its implementation in various neural network architectures. By watching this video, viewers can gain a deeper understanding of Convolutional Neu

Key Takeaways
  1. Understand the standard convolution operation
  2. Learn about Depthwise Separable Convolution
  3. Break down the convolution operation into two parts: depthwise convolution and pointwise convolution
  4. Apply Depthwise Separable Convolution to neural network architectures
  5. Evaluate the performance of Depthwise Separable Convolution
  6. Implement Depthwise Separable Convolution in various neural network architectures
  7. Analyze research papers on Depthwise Separable Convolution
  8. Reproduce research results on Depthwise Separable Convolution
💡 Depthwise Separable Convolution decreases computation and number of parameters compared to standard convolution, making it a more efficient method for convolutional neural networks.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →