Depthwise Separable Convolution - A FASTER CONVOLUTION!
Key Takeaways
The video discusses Depthwise Separable Convolution, a faster method of convolution with less computation power and parameters, and its applications in modern research, including neural network architectures like MobileNets and multi-model networks.
Full Transcript
convolution is a measure of overlap between two functions as one slides over the other mathematically it's a sum of products the standard convolution operation is slow to perform however we can speed this up with an alternative method that is the topic of this video depth wise separable convolution let's first very quickly go over the basics of convolution on an input volume consider an input volume f.o shape d f cross d f cross m where DF is the width and height of the input volume and M is the number of input channels if a color image was an input then M would be equal to 3/4 the RG and B channels we apply convolution on a kernel K of shape DK cross DK cross M this will give us an output of shape D G cross DG cross 1 if we apply n such kernels on the input then we get an output volume G of shape DG cross DG cross n the convolution operation takes the sum of products of the input and the kernel to return a scalar this operation is continued by sliding the kernel over the input I've explained this concept in detail on my video on convolution neural networks check that out for a clear understanding I'm more concerned now with the cost of this convolution operation so let's take a look at that we can measure the computation required for convolution by taking a look at the number of multiplications required so why is that it's because multiplication is an expensive operation relative to addition so let's determine the number of multiplications for one convolution operation the number of multiplications is the number of elements in that kernel so that would be D K times D K times M multiplications but we slide this kernel over the input we perform DG convolutions along the width and DG convolutions along the height and hence D G cross DG convolutions over all so the number of multiplications in the convolution of one kernel over the entire input f is DG square times D K square times M now this is for just one kernel but if we have n such kernels which makes the absolute total number of multiplications become n times D G square times D K square times M multiplications let's now take a look at depth wise separable convolutions in standard convolution the application of filters across all input channels and the combination of these values are done in a single step def y separable convolution on the other hand breaks us down into two parts the first is depth wise convolution that is it performs the filtering stage and then point wise convolution which performs the combining stage let's get into some details here depth wise convolution applies convolution to a single input channel at a time this is different from the standard convolution that applies convolution to all channels let us take the same input volume F to understand this process F has a shape D F cross D F cross M where D F is the width and height of the input volume and M is the number of input channels like I mentioned before for depth wise convolution we use filters or kernels K of shape DK cross DK cross one here DK is the width and height of the square kernel and it has a depth of 1 because this convolution is only applied to a channel unlike standard convolution which is applied throughout the entire day and since we apply one kernel to a single input channel we require M such DK cross DK cross one kernels over the entire input volume F for each of these M convolutions we end up with an output DG cross DG cross one in shape now stacking these outputs together we have an output volume of G which is of shape DG cross DG cross M this is the end of the first phase that is the end of depth wise convolution now this is succeeded by point wise convolution point wise convolution involves performing the linear combination of each of these layers here the input is the volume of shape DG cross DG cross M the filter K PC has a shape one cross one cross M this is basically a 1 Cross 1 convolution operation over all M layers the output will thus have the same input width and height as the input D G cross DG for each filter assuming that we want to use some n such filters the output volume becomes D G cross DG cross n so that's great we got this down now let's take a look at the complexity of this convolution we can split this into two parts as we have two phases first we compute the number of multiplications in depth wise convolution so here the kernels have a shape DK cross D K cross 1 so the number of multiplications on one convolution operation is all DK times DK DK square when applied over the entire input channel this convolution is performed DG x DG number of times so the number of multiplications for the kernel over the input channel becomes DG square times DK square now such multiplications are applied over all em input channels for each channel we have a different kernel and hence the total number of multiplications in the first phase that is depth wise convolution is M times D G square times D K square next we compute the number of multiplications in the second phase that is point wise convolution here the kernels have a shape one cross one cross M where m is the depth of the input volume and hence the number of multiplications for one instance of convolution is M this is applied to the entire output of the first phase which has a width and height of D G so the total number of multiplications for this kernel is d G times D G times M so for some n kernels will have n times D G times D G times M such multiplications and thus the total number of multiplications is the sum of multiplications in the depth wise convolution stage plus the number of multiplications in the point-wise convolution stage we can take M times D G squared common now we compare the standard convolution with depth wise convolution we get the ratio as the sum of reciprocal of the depth of output volume that is n and the reciprocal of the squared dimensions of the kernel DK to put this into perspective of how effective depth wise convolution is let us take an example so consider the output feature volume n of 1024 and a kernel of size 3 that's DK is equal to 3 plugging these values into the relation we get zero point 1 1 2 in other words standard convolution has 9 times more the number of multiplications as that of depth Y separable convolution this is a lot of computing power we can also quickly compare the number of parameters in both convolutions in standard convolution each kernel has k times D K times M learn about parameters since there are n such kernels there are n times M times D K squared parameters in depth by separable convolutions will split this once again into two parts in the depth wise convolution phase we use M kernels of shape DK cross DK in point wise convolution we use n kernels of shape 1 Cross 1 cross M so the total is M times DK square plus M times n or we can just take M common taking the ratio we get the same ratio as we did for computational power required so we understood exactly what depth wise convolution is and also its computation power with respect to the traditional standard convolution but where exactly has this been used well there are some very interesting papers here the first is on multi model neural networks these are networks designed to solve multiple problems using a single network a multi model network has four parts the first is modality Nets to convert different input types to a universal internal representation then we have an encoder to process inputs we have a mixer to encode inputs with previous outputs and we have a decoder to generate outputs a fundamental component of each of these parts is depth wise separable convolution it works effectively in such large networks next up we have exception a convolution neural network architecture based entirely on depth wise separable convolution layers it has shown the state-of-the-art performance on large datasets like Google's jft image data set it's a repository of 350 million images with 17,000 class labels to put this into perspective the popular image net took 3 days to Train however to Train even a subset of this jft data set it took a month and it didn't even converge in fact it would have approximately taken about three months to converge how'd they let it run to its full length so that's useful this paper is pushing convolution neural networks to use depth Y separable convolution as the de facto up third we have mobile Nets a neural network architecture that strives to minimize latency of smaller scale networks so that computer vision applications run well on mobile devices mobile nets used F Y separable convolutions in its 28 layer architecture this paper compares the performance of mobile nets with fully connected layers versus depth wise separable convolution layers it turns out the accuracy on image net only drops a 1% while using significantly less number of parameters from twenty nine point three million the number of parameters it's down to just 4.2 million we can see the mulch as the number of multiplications and additions which is a direct measure of computation has also significantly decreased for depth by separable convolution mobile Nets so here are some things to remember in this video depth Y separable convolution decreases the computation and number of parameters when compared to standard convolution second is that depth Y separable convolution is a combination of depth wise convolution followed by a point wise convolution depth wise convolution is the filtering step and point wise convolution can be thought of as the combination step finally they have been successfully implemented in neural network architectures like multi model networks exception and mobile nets and that's all I have for you now thank you all for stopping by today if you liked the video hit that like button if you want to stick around hit that subscribe button if you really want to stick around hit that Bell icon next to the subscribe button so as to be notified of my uploads immediately links to important papers are down below so check them out have a good day and I'll see you in the next one bye
Original Description
In this video, I talk about depthwise Separable Convolution - A faster method of convolution with less computation power & parameters. We mathematically prove how it is faster, and discuss applications where it is used in modern research.
If you liked that video, hit that like button. If you wanna stick around, hit that subscribe button. If you really wanna stick around, hit that bell icon next to the subscribe button to be notified of my uploads immediately.
Convolution Neural Networks: https://www.youtube.com/watch?v=m8pOnJxOcqY
REFERENCES
Xception (main paper): https://arxiv.org/pdf/1610.02357.pdf
Mobile Nets (Efficient CNN for mobile vision applications) : https://arxiv.org/pdf/1704.04861.pdf
One model Learns all: https://arxiv.org/pdf/1706.05137v1.pdf
Music at : https://www.bensound.com/royalty-free-music/track/tenderness
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 8 of 60
1
2
3
4
5
6
7
▶
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: Research Methods
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI