Depthwise Separable Convolution - A FASTER CONVOLUTION!

CodeEmporium · Advanced ·📄 Research Papers Explained ·8y ago

Skills: Research Methods90%Reading ML Papers80%Paper Reproduction70%

Key Takeaways

The video discusses Depthwise Separable Convolution, a faster method of convolution with less computation power and parameters, and its applications in modern research, including neural network architectures like MobileNets and multi-model networks.

Full Transcript

convolution is a measure of overlap between two functions as one slides over the other mathematically it's a sum of products the standard convolution operation is slow to perform however we can speed this up with an alternative method that is the topic of this video depth wise separable convolution let's first very quickly go over the basics of convolution on an input volume consider an input volume f.o shape d f cross d f cross m where DF is the width and height of the input volume and M is the number of input channels if a color image was an input then M would be equal to 3/4 the RG and B channels we apply convolution on a kernel K of shape DK cross DK cross M this will give us an output of shape D G cross DG cross 1 if we apply n such kernels on the input then we get an output volume G of shape DG cross DG cross n the convolution operation takes the sum of products of the input and the kernel to return a scalar this operation is continued by sliding the kernel over the input I've explained this concept in detail on my video on convolution neural networks check that out for a clear understanding I'm more concerned now with the cost of this convolution operation so let's take a look at that we can measure the computation required for convolution by taking a look at the number of multiplications required so why is that it's because multiplication is an expensive operation relative to addition so let's determine the number of multiplications for one convolution operation the number of multiplications is the number of elements in that kernel so that would be D K times D K times M multiplications but we slide this kernel over the input we perform DG convolutions along the width and DG convolutions along the height and hence D G cross DG convolutions over all so the number of multiplications in the convolution of one kernel over the entire input f is DG square times D K square times M now this is for just one kernel but if we have n such kernels which makes the absolute total number of multiplications become n times D G square times D K square times M multiplications let's now take a look at depth wise separable convolutions in standard convolution the application of filters across all input channels and the combination of these values are done in a single step def y separable convolution on the other hand breaks us down into two parts the first is depth wise convolution that is it performs the filtering stage and then point wise convolution which performs the combining stage let's get into some details here depth wise convolution applies convolution to a single input channel at a time this is different from the standard convolution that applies convolution to all channels let us take the same input volume F to understand this process F has a shape D F cross D F cross M where D F is the width and height of the input volume and M is the number of input channels like I mentioned before for depth wise convolution we use filters or kernels K of shape DK cross DK cross one here DK is the width and height of the square kernel and it has a depth of 1 because this convolution is only applied to a channel unlike standard convolution which is applied throughout the entire day and since we apply one kernel to a single input channel we require M such DK cross DK cross one kernels over the entire input volume F for each of these M convolutions we end up with an output DG cross DG cross one in shape now stacking these outputs together we have an output volume of G which is of shape DG cross DG cross M this is the end of the first phase that is the end of depth wise convolution now this is succeeded by point wise convolution point wise convolution involves performing the linear combination of each of these layers here the input is the volume of shape DG cross DG cross M the filter K PC has a shape one cross one cross M this is basically a 1 Cross 1 convolution operation over all M layers the output will thus have the same input width and height as the input D G cross DG for each filter assuming that we want to use some n such filters the output volume becomes D G cross DG cross n so that's great we got this down now let's take a look at the complexity of this convolution we can split this into two parts as we have two phases first we compute the number of multiplications in depth wise convolution so here the kernels have a shape DK cross D K cross 1 so the number of multiplications on one convolution operation is all DK times DK DK square when applied over the entire input channel this convolution is performed DG x DG number of times so the number of multiplications for the kernel over the input channel becomes DG square times DK square now such multiplications are applied over all em input channels for each channel we have a different kernel and hence the total number of multiplications in the first phase that is depth wise convolution is M times D G square times D K square next we compute the number of multiplications in the second phase that is point wise convolution here the kernels have a shape one cross one cross M where m is the depth of the input volume and hence the number of multiplications for one instance of convolution is M this is applied to the entire output of the first phase which has a width and height of D G so the total number of multiplications for this kernel is d G times D G times M so for some n kernels will have n times D G times D G times M such multiplications and thus the total number of multiplications is the sum of multiplications in the depth wise convolution stage plus the number of multiplications in the point-wise convolution stage we can take M times D G squared common now we compare the standard convolution with depth wise convolution we get the ratio as the sum of reciprocal of the depth of output volume that is n and the reciprocal of the squared dimensions of the kernel DK to put this into perspective of how effective depth wise convolution is let us take an example so consider the output feature volume n of 1024 and a kernel of size 3 that's DK is equal to 3 plugging these values into the relation we get zero point 1 1 2 in other words standard convolution has 9 times more the number of multiplications as that of depth Y separable convolution this is a lot of computing power we can also quickly compare the number of parameters in both convolutions in standard convolution each kernel has k times D K times M learn about parameters since there are n such kernels there are n times M times D K squared parameters in depth by separable convolutions will split this once again into two parts in the depth wise convolution phase we use M kernels of shape DK cross DK in point wise convolution we use n kernels of shape 1 Cross 1 cross M so the total is M times DK square plus M times n or we can just take M common taking the ratio we get the same ratio as we did for computational power required so we understood exactly what depth wise convolution is and also its computation power with respect to the traditional standard convolution but where exactly has this been used well there are some very interesting papers here the first is on multi model neural networks these are networks designed to solve multiple problems using a single network a multi model network has four parts the first is modality Nets to convert different input types to a universal internal representation then we have an encoder to process inputs we have a mixer to encode inputs with previous outputs and we have a decoder to generate outputs a fundamental component of each of these parts is depth wise separable convolution it works effectively in such large networks next up we have exception a convolution neural network architecture based entirely on depth wise separable convolution layers it has shown the state-of-the-art performance on large datasets like Google's jft image data set it's a repository of 350 million images with 17,000 class labels to put this into perspective the popular image net took 3 days to Train however to Train even a subset of this jft data set it took a month and it didn't even converge in fact it would have approximately taken about three months to converge how'd they let it run to its full length so that's useful this paper is pushing convolution neural networks to use depth Y separable convolution as the de facto up third we have mobile Nets a neural network architecture that strives to minimize latency of smaller scale networks so that computer vision applications run well on mobile devices mobile nets used F Y separable convolutions in its 28 layer architecture this paper compares the performance of mobile nets with fully connected layers versus depth wise separable convolution layers it turns out the accuracy on image net only drops a 1% while using significantly less number of parameters from twenty nine point three million the number of parameters it's down to just 4.2 million we can see the mulch as the number of multiplications and additions which is a direct measure of computation has also significantly decreased for depth by separable convolution mobile Nets so here are some things to remember in this video depth Y separable convolution decreases the computation and number of parameters when compared to standard convolution second is that depth Y separable convolution is a combination of depth wise convolution followed by a point wise convolution depth wise convolution is the filtering step and point wise convolution can be thought of as the combination step finally they have been successfully implemented in neural network architectures like multi model networks exception and mobile nets and that's all I have for you now thank you all for stopping by today if you liked the video hit that like button if you want to stick around hit that subscribe button if you really want to stick around hit that Bell icon next to the subscribe button so as to be notified of my uploads immediately links to important papers are down below so check them out have a good day and I'll see you in the next one bye

Original Description

In this video, I talk about depthwise Separable Convolution - A faster method of convolution with less computation power & parameters. We mathematically prove how it is faster, and discuss applications where it is used in modern research. If you liked that video, hit that like button. If you wanna stick around, hit that subscribe button. If you really wanna stick around, hit that bell icon next to the subscribe button to be notified of my uploads immediately. Convolution Neural Networks: https://www.youtube.com/watch?v=m8pOnJxOcqY REFERENCES Xception (main paper): https://arxiv.org/pdf/1610.02357.pdf Mobile Nets (Efficient CNN for mobile vision applications) : https://arxiv.org/pdf/1704.04861.pdf One model Learns all: https://arxiv.org/pdf/1706.05137v1.pdf Music at : https://www.bensound.com/royalty-free-music/track/tenderness

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 8 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video teaches Depthwise Separable Convolution, a faster method of convolution with less computation power and parameters, and its applications in modern research, including neural network architectures like MobileNets and multi-model networks. The video provides a mathematical proof of the efficiency of Depthwise Separable Convolution and discusses its implementation in various neural network architectures. By watching this video, viewers can gain a deeper understanding of Convolutional Neu

Key Takeaways

Understand the standard convolution operation
Learn about Depthwise Separable Convolution
Break down the convolution operation into two parts: depthwise convolution and pointwise convolution
Apply Depthwise Separable Convolution to neural network architectures
Evaluate the performance of Depthwise Separable Convolution
Implement Depthwise Separable Convolution in various neural network architectures
Analyze research papers on Depthwise Separable Convolution
Reproduce research results on Depthwise Separable Convolution

💡 Depthwise Separable Convolution decreases computation and number of parameters compared to standard convolution, making it a more efficient method for convolutional neural networks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Research Methods

View skill →

Mechanics of Materials III: Beam Bending

Mechanics of Materials III: Beam Bending

Inaugural Lecture: Juliane Reinecke

Inaugural Lecture: Juliane Reinecke

Saïd Business School, University of Oxford

Hands-On Learning: How and Why You Should Build a Home Lab

Hands-On Learning: How and Why You Should Build a Home Lab

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

SANS Live Online Interactive Remote Lab and Range Demo – SEC599: Defeating Advanced Adversaries

Does Water Swirl the Other Way in the Southern Hemisphere?

Does Water Swirl the Other Way in the Southern Hemisphere?

Undergraduate Research Forum 2026

Undergraduate Research Forum 2026

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling