Batch normalization | What it is and how to implement it

AssemblyAI · Intermediate ·🧬 Deep Learning ·4y ago

Skills: LLM Foundations85%

Key Takeaways

The video discusses Batch Normalization, a technique used to deal with unstable gradients in neural networks, and how to implement it using Keras and Python, highlighting its benefits in reducing overfitting and improving training speed.

Full Transcript

wouldn't it be amazing to have a way of dealing with the unstable gradients problem in our neural networks while also making the network train a little bit faster and also maybe even dealing with the overfitting problem at the same time well if you want that you're in the right place because today we're talking about batch normalization this video is part of the deep learning explained series by assembly ai a company building a state-of-the-art speech to text api and if you want a part of it you can go get their free api token using the link in the description we will first talk about how batch normalization works and how everything works under the hood and then we will talk about some benefits of batch normalization and why we would choose to use it and lastly i will show you how to implement it in python code using keras the first thing that i want to clarify is the definition of normalization so you might have heard in a lot of different places that you need to normalize your input before you feed it to your neural network or you need to standardize your input before you fit it to neural network these terms sometimes are used interchangeably and they're not really strictly defined so but let's define them here just so you know what i mean when i say normalization or standardization normalization is collapsing the input range that you have to be between 0 and 1 whereas standardization is changing your value so that their mean equals to 0 and the variance or standard deviation equals to 1. so on like a little visual what it would look like is let's say if we had some values that go from 0 to 100 let's say they were 20 70 and 90 after we normalize them they're going to be between 0 and 1 so then they'll probably be 0.2 0.7 and 0.9 still keeping the ratio that they had to each other whereas for standardization if we had something similar at the end we're going to change these values so that when we put them in a distribution their mean is going to be at zero and the most amount of values are going to be between -1 and 1 and then as you go further in the distribution you're going to see less and less of these values so why do we need any sort of normalization for our neural networks to begin with so let's take this example let's say we have a neural network and we want to feed data into it our two features are a number of phones that were ever owned by someone and the number of um or the amount of money they've withdrawn from the atm today so as you can see they have very different ranges one of them goes from two to twenty four and the other one goes from zero to a thousand so what's going to happen if you feed your network the unnormalized version of this data is that you're going to make it very hard for your network to learn the optimal weight values that minimizes the cost or the error and in turn you will also cause your network to have weights that are very different than each other so the weight that we are going to multiply the number of phones with is going to be very different than the number of then the weight that we're going to multiply the input that is the amount of money that was withdrawn from the bank account so in turn you might cause your network to also be unstable and in the end have a network that has a vanishing or the exploding gradients problem so what we do to overcome this problem normally is that first of all of course we normalize our inputs and we also try to use the correct weight initial initialization technique and also the correct activation function that goes with this weight initialization technique but even if you do everything correctly unstable gradients problem might come back later in the training but there is one solution that could save the day and that's batch normalization so what we do with batch normalization is instead of only normalizing our inputs and then feeding the data into our network we normalize all the outputs of all the layers in our network so in this diagram you can see we have our network and in each in between each layer we have a batch normalization layer so what the it does is basically normalize our data and do a little bit more of a small trick on top of it and then feed the data or feed the output from the previous layer to the next layer so let's see how that works in this small example let's say we have six data points they go from three to twenty four and we have three five eight nine eleven and twenty four what bash normalization does at first is to standardize them based on what we were talking in the first lesson you can call it normalization two but what it does is to make sure that their mean is zero and their variance is one so it recalculates them and puts them in the correct place but after it does that this is not the end of what batch normalization does it also scales and offsets these values by some amount that is going to be determined based on the training process so as you can see here and this is kind of like the last step the formula from the last step of batch normalization we have the values that have been changed already that have been standardized and on top of these values we multiply them by some value which is called the scale and we also add another value to them which is called the offset these two values are basically trainable parameters we're not going to determine them they're not hyper parameters or anything we're not going to determine them before the training starts these are going to be learned like any other parameter in the network like the weights and the biases so what it would look like if i scale this value these values that i have right now if i multiply them by two if i scale them by two where this is going to basically be multiplying by them by two and what else you can do is to offset them if you want to offset them by 0.5 and then it's going to look like this it's basically going to be sliding them a little bit on the axis that they're on so this is what bash normalization does to kind of find a good transformation a transformation that works for these data points to help the network overcome the unstable gradients problem and in turn it actually makes it train a little bit faster well you might say how does that work there are so many extra calculations that we need to do in between the hidden layers how do we end up having a network that trains faster and you're right what happens is when you're training a network that has batch normalization the epochs take longer every epoch takes a bit longer than it would have if there were no batch normalization but in the end batch normalization helps us achieve the same accuracy that we did without having bash normalization with less epoch so at the end the amount of time that we add because of patch normalization is much less than the amount of time that we save because we added batch normalization and not surprisingly when you can train your network with fewer epochs to achieve the same accuracy that you did without batch normalization you can of course train it a little bit more and maybe even achieve better performance and on top of that because this is a normalization layer if you'd like you do not have to separately normalize your or standardize your inputs before feeding it to your neural network but you can just have a batch normalization layer before your first layer and then effectively your impulse will be normalized so you can keep everything in one neat package so that's another advantage of using mesh normalization and lastly it was seen that batch normalization actually reduces the need for doing regularization if you remember regularization is something we did to deal with overfitting but with batch normalization you don't even have to do that anymore but of course you might need to try this out for your own network and then see if that's actually the case or not but it was shown that it is actually one of the other benefits of batch normalization so that was all that i want to say in terms of what how batch normalization works and the benefits of batch normalization now let's see how we can implement it using keras and python i will show you how it works using the mnist dataset so really classic example of handwritten digits here i'm just importing the libraries that i need and the dataset from keras and this is what the data points look like so this is one example of data it is a 28 to 28 image so that means there are 784 pixels each of these pixels have a value so this this value goes from 0 to 255 and the lower the value the darker the pixel and the higher the value the lighter the pixel so if you look at this example probably this one this dark one here is around like 200 whereas this one is probably 70 and the actual fully white ones are going to be 255. so what we want to do before we feed this data set to our network is to normalize it one way of doing it is basically just dividing all the training values to 200 255 and then effectively you're going to have a network or a data set in your hand that goes where all the values go from zero to one and later you feed that data to your network that you created here and train it as you wish let's look at what our network looks like we basically have one flattened layer that takes a 28 to 28 matrix and then flattens it to be one long list of 784 values and then we have two hidden layers one with 300 neurons the other with 100 neurons and a output layer with 10 neurons so what if i wanted to have batch normalization in here well it's very simple actually all you have to do for batch normalization is to add one layer and this is one of the predetermined or predefined keras layers and that is called batch normalization you just need to put it in between two layers where you want it to be so i can also put it here after the second hidden layer and now my network has batch normalization but as i said if you need to normalize your data and if you're doing it manually you can exchange that to use instead batch normalization and how you're going to do that is basically before you feed the data into the hidden layers you just need to have a batch normalization layer so by doing this after i flatten my input i am putting it through batch normalization so the values that are going to be fed to the first dense layer are going to be normalized so i do not have to do this anymore so this is one advantage of using batch normalization everything in one place you don't have to worry about manually normalizing separately there is one other detail that you should pay attention to while you're implementing bios normalization and that is deciding to put batch normalization before or after an activation function the authors of the original bios normalization paper spoke favorably about this technique of using bash normalization before the activation function but this is something that you might want to try out and decide for yourself if it works for your specific system and specific problem but i'll show you how to do that if you wanted to so basically when you have a dense layer the activation function is already included in here we specify that it needs to be the really activation function but if you wanted to you can have your activation function as a separate layer so if i did this that means then whatever was outputted from the batch normalization layer will be fed through an activation function then i would not need to have an activation function anymore in this layer and i can do that for the second hidden layer too then i would be taking the output of a layer through batch normalization first and then through activation and this is something that people argue that can work and might be better for your network but there is one other detail that we should look into here and that is to usage of bias so if you remember what happens in a dense or layer or a hidden layer is that we get some sort of input from the previous layer right let's call it x and we have our weights so we multiply the input with the weights and then we add a bias to it so when we have an activation function also already built in what we do is we put these values through an activation function and that is the output of our dense layer so if we strip the activation function out of this that means what is going to be fed to the batch normalization layer is going to be the this is going to be this value but what if we what do we do with batch normalization we normalize the values and then we scale them and then we offset them if you remember and of setting is basically the same as bias you just add one value to it so at the end you do not really need your biases anymore you can just train the offset values to find its optimal value inside your neural network rather than all having a bias and an offset so that kind of like helps you have a lower amount of parameters and also helps you train your network attach faster so then all you have to do is inside your dense layer you just say use bias false because you don't want to use any bias here but that's it when it comes to implementing batch normalization it's very simple it's just one extra layer that you can add if you're using keras to build your network uh just realize that you can use it as a normalization layer without the separate manual normalization that you need to do and also make sure that you decide if you want to use it before or after the activation function of your layers thanks for watching and i hope you enjoyed this video if you liked this video don't forget to give us a like and maybe even subscribe because we're going to be here every single week if you have any questions or comments i would love to see that in the comment section also if you'd like to integrate speech to text capabilities to your own projects you can go grab the free api token from assembly ai using the link in the description but for now have a nice day and i'll see you around

Original Description

In this video, we will learn about Batch Normalization. Batch Normalization is a secret weapon that has the power to solve many problems at once. It is a great tool to deal with the unstable gradients problem, helps deal with overfitting and might even make your models train faster. Get your free speech-to-text API token 👇 https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_mis_4 We will first go into what batch normalization is and how it works. Later we will talk about why you might want to use it in your projects and some benefits of it. And lastly, we will learn how to apply Batch Learning to your models using Python and Keras. Even though it is fairly simple to apply Batch Normalization using Keras, we will touch upon some details that might need extra care.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AssemblyAI · AssemblyAI · 9 of 60

← Previous Next →

Python Speech Recognition in 5 Minutes

Python Speech Recognition in 5 Minutes

Python Click Part 1 of 4

Python Click Part 1 of 4

Python Click Part 2 of 4

Python Click Part 2 of 4

Python Click Part 3 of 4

Python Click Part 3 of 4

Python Click Part 4 of 4

Python Click Part 4 of 4

Deep learning in 5 minutes | What is deep learning?

Deep learning in 5 minutes | What is deep learning?

How to make a web app that transcribes YouTube videos with Streamlit | Part 1

How to make a web app that transcribes YouTube videos with Streamlit | Part 1

How to make a web app that transcribes YouTube videos with Streamlit | Part 2

How to make a web app that transcribes YouTube videos with Streamlit | Part 2

Batch normalization | What it is and how to implement it

Batch normalization | What it is and how to implement it

Real-time Speech Recognition in 15 minutes with AssemblyAI

Real-time Speech Recognition in 15 minutes with AssemblyAI

Regularization in a Neural Network | Dealing with overfitting

Regularization in a Neural Network | Dealing with overfitting

Add speech recognition to your Streamlit apps in 5 minutes

Add speech recognition to your Streamlit apps in 5 minutes

Transformers for beginners | What are they and how do they work

Transformers for beginners | What are they and how do they work

Automatic Chapter Detection With AssemblyAI | Python Tutorial

Automatic Chapter Detection With AssemblyAI | Python Tutorial

Deep Learning Series Part 1 - What is Deep Learning?

Deep Learning Series Part 1 - What is Deep Learning?

Deep Learning Series part 2 - Why is it called “Deep Learning”?

Deep Learning Series part 2 - Why is it called “Deep Learning”?

Activation Functions In Neural Networks Explained | Deep Learning Tutorial

Activation Functions In Neural Networks Explained | Deep Learning Tutorial

Deep Learning Series part 3 - Deep Learning vs. Machine Learning

Deep Learning Series part 3 - Deep Learning vs. Machine Learning

Deep Learning Series part 4 - Why is Deep Learning better for NLP?

Deep Learning Series part 4 - Why is Deep Learning better for NLP?

Intro to Batch Normalization Part 1

Intro to Batch Normalization Part 1

Intro to Batch Normalization Part 2

Intro to Batch Normalization Part 2

Intro to Batch Normalization Part 3 - What is Normalization?

Intro to Batch Normalization Part 3 - What is Normalization?

Intro to Batch Normalization Part 4

Intro to Batch Normalization Part 4

Intro to Batch Normalization Part 5

Intro to Batch Normalization Part 5

Sentiment Analysis for Earnings Calls with AssemblyAI

Sentiment Analysis for Earnings Calls with AssemblyAI

Summarizing my favorite podcasts with Python

Summarizing my favorite podcasts with Python

Introduction to Regularization

Introduction to Regularization

How/Why Regularization in Neural Networks?

How/Why Regularization in Neural Networks?

Getting Started With Torchaudio | PyTorch Tutorial

Getting Started With Torchaudio | PyTorch Tutorial

Types of Regularization

Types of Regularization

Tuning Alpha in L1 and L2 Regularization

Tuning Alpha in L1 and L2 Regularization

Dropout Regularization

Dropout Regularization

What is GPT-3 and how does it work? | A Quick Review

What is GPT-3 and how does it work? | A Quick Review

Backpropagation For Neural Networks Explained | Deep Learning Tutorial

Backpropagation For Neural Networks Explained | Deep Learning Tutorial

Jupyter Notebooks Tutorial | How to use them & tips and tricks!

Jupyter Notebooks Tutorial | How to use them & tips and tricks!

Best Free Speech-To-Text APIs and Open Source Libraries

Best Free Speech-To-Text APIs and Open Source Libraries

Regularization - Early stopping

Regularization - Early stopping

Regularization - Data Augmentation

Regularization - Data Augmentation

Bias and Variance for Machine Learning | Deep Learning

Bias and Variance for Machine Learning | Deep Learning

Recurrent Neural Networks (RNNs) Explained - Deep Learning

Recurrent Neural Networks (RNNs) Explained - Deep Learning

What is BERT and how does it work? | A Quick Review

What is BERT and how does it work? | A Quick Review

Introduction to Transformers

Introduction to Transformers

Transformers | What is attention?

Transformers | What is attention?

Transformers | how attention relates to Transformers

Transformers | how attention relates to Transformers

Transformers | Basics of Transformers

Transformers | Basics of Transformers

Supervised Machine Learning Explained For Beginners

Supervised Machine Learning Explained For Beginners

Transformers | Basics of Transformers Encoders

Transformers | Basics of Transformers Encoders

Transformers | Basics of Transformers I/O

Transformers | Basics of Transformers I/O

How to evaluate ML models | Evaluation metrics for machine learning

How to evaluate ML models | Evaluation metrics for machine learning

Unsupervised Machine Learning Explained For Beginners

Unsupervised Machine Learning Explained For Beginners

Weight Initialization for Deep Feedforward Neural Networks

Weight Initialization for Deep Feedforward Neural Networks

Q-Learning Explained - Reinforcement Learning Tutorial

Q-Learning Explained - Reinforcement Learning Tutorial

Should You Use PyTorch or TensorFlow in 2022?

Should You Use PyTorch or TensorFlow in 2022?

What is Layer Normalization? | Deep Learning Fundamentals

What is Layer Normalization? | Deep Learning Fundamentals

I created a Python App to study FASTER

I created a Python App to study FASTER

How to create your FIRST NEURAL NETWORK with TensorFlow!

How to create your FIRST NEURAL NETWORK with TensorFlow!

Neural Networks Summary: All hyperparameters

Neural Networks Summary: All hyperparameters

Getting Started with OpenAI API and GPT-3 | Beginner Python Tutorial

Getting Started with OpenAI API and GPT-3 | Beginner Python Tutorial

Convert Speech-To-Text In Python in 60 seconds!

Convert Speech-To-Text In Python in 60 seconds!

Gradient Clipping for Neural Networks | Deep Learning Fundamentals

Gradient Clipping for Neural Networks | Deep Learning Fundamentals

Batch Normalization is a powerful technique to stabilize gradients in neural networks, reducing overfitting and improving training speed. This video teaches how to implement Batch Normalization using Keras and Python, and how to use it to improve model performance.

Key Takeaways

Recalculate mean and variance of data points
Scale and offset values by trainable parameters
Add a batch normalization layer before the first layer
Add a batch normalization layer after the second hidden layer
Put batch normalization before the activation function
Put batch normalization after the activation function

💡 Batch Normalization can be used before or after the activation function in a dense layer, and can replace the need for bias in a dense layer, reducing the number of parameters and improving training speed.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train