Weight Standardization (Paper Explained)

Yannic Kilcher · Advanced ·👁️ Computer Vision ·6y ago

Skills: CV Basics80%Modern CV Models70%

Key Takeaways

Weight Standardization is a normalization technique that normalizes the weights of a neural network, leading to improved performance and state-of-the-art results when combined with GroupNorm, with tools such as GroupNorm and BatchNorm being utilized in the process.

Full Transcript

hi there today we're looking at weight standardization by c1 cow who you won Shan Chi Yu Wei Shen Alan you'll of John Hopkins University so weight standardization is a normalization technique for training neural networks and it goes basically in conjunction with another technique called group normalization so if you haven't group normalization norm that is ugly if you haven't seen my video on group normalization and don't know what it is I suggest you go watch that first or read the group norm paper or some blog post because weight standardization is usually used together with group norm in order to work well and that's what this paper also says even though it's pretty much independent but here you can see their main results so they compare batch norm group norm and weight standardization used with group norm then they can as you can see here they can outperform in the image net top one accuracy the other two models and the important part here as you can see is batch norm is trained with large batch sizes while group norm and group norm plus weight Standardization are trained with one image per GPU so they have a multi GPU set up and this is just one image per GPU and these results over here are on a mask our cm masks are CNN which I believe is a recurrent model where the model is large because the kind of the model is large and therefore you can only have very small batches per worker and that means batch norm will work less again we've discussed why batch norm is not a good thing when you have to go to small batch sizes because basically what people have discovered is that it is very beneficial in machine learning to normalize your data before working with it what do we mean by it so if you have a bunch of data points right here and let's say like this it is it is usually beneficial to first Center the data like this so basically calculate its mean and shift it and then to standardize the axes so basically you divide it by the standard deviation in each direction and your data will look something like this of many classical methods that will improve the conditioning numbers of the requirements to solve it and so on and even of deep learning methods we just know that if you standardize your data like this it works better so people are basically have come up with these methods that where they say well if it helps for the data at the beginning of a neural network then if after if after a layer the data is kind of out of whack that can happen after a layer of neural network we should maybe first before we send it to the next layer do the same thing Center it again and then send it through and if after the next layer again it's out of whack we should maybe Center it and standardize it again before sending it through the next layer so it each layer you have these transformations that Center and standardize the data and usually for the longest time this was a batch norm batch norm does this across the mini batches of the data since you can't pass the entire data set now group norm has come and replaced batch norm because in batch Arum it's very dependent on the batch size while group norm isn't the group norm paper has sort of made it clear that in Kentucky in competitive batch sizes in the large batch size regime group normos sorry batch norm is still the king batch norm still works better it's only when you go to very small batch sizes that group norm takes over and that's what you can see here so here okay it's a bit unfair because batch norm is trained with a larger batch size but even if group norm were to be trained with the large batch size it would still be in the same place because no it wouldn't it would not sorry that is that is not the case because the batches still influence the gradient stochasticity and so on but still batch norm is better than group norm as you can see here but here over here where you kind of have to go to the small batch sizes then batch norm is all of a sudden worse than group norm and the weight standardization is a technique to actually make group norm better than batch norm in any of these so even in these in the large batch regime okay so will now explore weight standardization so in the group norm paper we've looked at the diagram on the left so basically in batch norm here is the number of data points this is your batch this is the channels of the batch of the individual images channels and this is the height and width of the image so this is the image itself a single channel so a single channel in the image would be a column in this thing right here batch norm normalizes across the data points in a single channel layer norm which is a precursor to group norm normalizes only in a single data point instance but across all of the channels as you can see here now that frees its dependence on the batch size each data point is traded individually but of course it sort of convolved all the channels with each other it doesn't distinguish them instance norm tries to fix this instance norm down here tries to fix this by saying it was a good idea to own to normalize each feature individually and takes it to the extreme basically normalizes a single image for by each of these single features but that loses too much information group norm comes and says maybe some of the features naturally depend on each other naturally exhibit the same responses therefore we should normalize them in groups so we take still a single image but we take groups in this case groups of three channels together and normalize across that now this here is all in data space this all normalizes the data like we said up here when we drew this this is all normalizing the data before passing it through the next layer now what actually happens in these layers so what happens here what happens here in a convolutional neural network is that the images get convolved with kernels that's how that's what a neural network layer is so if you have an image right here of our trusty cat I've drawn whiskers in a while that nose is very high the eyes must be like up here sorry cat and the layer inherently has these things called kernels now I'm just gonna draw one of these kernels right here it's a 3 by 3 kernel and what you'll do is you'll slide the kernel across this right across like this you slide it across across across across and for each point you convolve the kernel so you can volve the values here with the pixels here and sum them up and that for each position in the image means that you'll basically get a new value at each point and that will be your next layers data point now in these normalization techniques we usually normalize the data points so here you have multiple channels maybe a red a green and the blue and so on and did the intermediate layers you have even more and but you also have multiple kernels you can see here you have multiple of these kernels which will then result in multiple output channels the old normalization methods batch norm layer norm group norm they all work in they all work in this or in this space in the space of data whereas weight standardization works on the kernel space so weight standardization means you want to normalize the weights of the neural network not the data and that's why it can be used in conjunction with something like group norm or actually batch norm or layer norm could be used with any of these but these authors use it in conjunction with group norm so what does it do if you have these kernels the kernels are of our characterize actually a kernel is characterized by four numbers so first of all it's the height and width of the kernel which in our case was 3 by 3 and is characterized by two more numbers which is the C in that in channels and the out channels so the in channels is the number of channels that come into the layer and the out channels are the number of channels that you want to transform that into so here you can see the in channels are listed here and out channels are listed here and in the up-down direction which is not labeled here is the height and width so this here would be actually a two by two kernels so each of these slivers here is a two by two kernel in the convolutional Network and then that would be the orange sliver here and then the sliver behind that would be the next two by two colonel weight standardization says hi hey it might be as we normalize the data it might be a good idea sorry I was that was wrong one column here one of these columns is a two by two filter and then the column behind int and the column next to it a they're all two by two filters right so you two by two filters in the output or end usf two by two filters for each of these for each of the input output channel combination you have a two by two filter so you have an entire matrix of two by two filters if you can imagine that so across the out and across the indirection weight standardization says how it might be a good idea to see that the weights for a given output channel right this is we take one output channel and we see all the filters that transform the input into that one output channel which is going to be this many times this many times there's many numbers or this many filters maybe we should normalize all of these to be sort of to not get out of whack because one could imagine that during training right if we start we initialize our filters somewhere here you know maybe one number this is this one number here we initialize it randomly right we draw it from random and then maybe as we train it actually gets very large because it's actually plausible because after that we we you know this is our neural network layer after that we have this procedure to recenter the data right so I could make a very large weight here multiply the data by very large weight because it gets resented anyway but of course if my weights get large I'll basically increase the variance and the instability and the gradients might be high and and so on so these author think it might be a good idea to normalize these weights so just as you normalize the data you'd normalize the weights and this actually turns out to be fairly easy in the sense of how you would do it so instead of transforming X which is the input to a layer into y using W so this is w this is your actual parameter using W you would you won't do this right now so this this was usually you just do you just do x times W and that gives you Y okay this is a convolution operation right here now you don't do this you do you have take W and first you subtract the mean of W this is now for a single output channel and then you divide by the standard deviation I'm a standard deviation of W and that entire thing you now multiply by X now since these things here are sorry about that since these things here or just you know deterministic operation you can actually back propagate through it so the forward path of data now looks as follows you come you start you say okay my data comes in I will take my weights that my layer weights and I will first Center them then scale them with its standard deviation and then I will use that thing and X in order to obtain my layer output and then I'll send that to the next layer now the back prop signal here is interesting because the back prop signal comes in from here and splits up into a it splits up into the back prop signal basically you have to back prop through the x times W hat operation we know how to do that that's just a convolutional back prop that you back prop through the convolution operation back to the last layer now usually when you back drop through the convolution operation you get two things you get the derivative with respect to X and you get the derivative with respect to the weights W and you can send both on and you would update your weights with that gradient but now what you'll have to do because this is not your actual parameter of the network you have to take that particular signal and you have to basically reverse the standardization and the centering before you can apply the gradient but that's all doable in the actually modern frameworks will do it by themselves but it's just that the the backprop path here in it introduces two new operation to the forward and to the back prop path that you didn't have before but I can imagine this will actually not take you won't even notice that this is happening this is so fast so they the idea is basically pretty basic especially since the entire discussion around normalization has already happened I enjoy that this paper does go into the theory a bit more so they analyze what this weight standardization what effect it has on the Lipschitz constant of the loss for example and they also research what what what contributes more the centering of the weights or the standardization so they kind of run all these ablations where they figure out okay if we just do group norm we have one we you know we have this trajectory here and if we run group non plus equation 5 which is subtracting the mean you can see the blue and the orange that is quite a bit and if we only do the dividing by the standard deviation you can see it's pretty close together but there is a difference if you do both then again there is a difference to only doing the centering so they they say even though you know probably subtracting the mean gives you most of the benefit since it is so easy you should just do both and I honestly think and here in the in the in the validation error that makes basically no difference at all and they do quite a number of these ablations which I'm not gonna go into too much and they do also the sort of Lipschitz constant of the loss and the Lipschitz constant of the gradients they basically show that the loss and the gradients are behaved more more well behaved when you use this weight standardisation technique together with group norm they also do quite a bit of experiments where they show that their method outperforms bathroom and especially in the small batch size regime and that is something that I absolutely believe what happened here okay I we actually don't even need to go down there because um if you want to read the paper I invite you to read the paper it's a very good paper I enjoyed reading it but ultimately they suggest this new method and also I have seen this one replicated across the community a number of times so it seems to be a thing that I would expect either it fizzes out and the community decides that it's about the same as batch norm and therefore not worth it or and that's what I believe since we also go into the direction of larger models which means smaller batches per worker and generally batch enorm is a pain I believe this is just going to be rather standard in the future so I will actually incorporate this if I can into my next projects so that was it for me if you like this consider subscribing consider leaving a like on the video thank you for listening if you have any comments I will very probably read them bye bye

Original Description

It's common for neural networks to include data normalization such as BatchNorm or GroupNorm. This paper extends the normalization to also include the weights of the network. This surprisingly simple change leads to a boost in performance and - combined with GroupNorm - new state-of-the-art results. https://arxiv.org/abs/1903.10520 Abstract: In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: this https URL. Authors: Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 0 of 60

← Previous Next →

Imagination-Augmented Agents for Deep Reinforcement Learning

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Learning model-based planning from scratch

Reinforcement Learning with Unsupervised Auxiliary Tasks

Reinforcement Learning with Unsupervised Auxiliary Tasks

Attention Is All You Need

Attention Is All You Need

git for research basics: fundamentals, commits, branches, merging

git for research basics: fundamentals, commits, branches, merging

Curiosity-driven Exploration by Self-supervised Prediction

Curiosity-driven Exploration by Self-supervised Prediction

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Stochastic RNNs without Teacher-Forcing

Stochastic RNNs without Teacher-Forcing

What’s in a name? The need to nip NIPS

What’s in a name? The need to nip NIPS

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

GPT-2: Language Models are Unsupervised Multitask Learners

GPT-2: Language Models are Unsupervised Multitask Learners

Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Discriminating Systems - Gender, Race, and Power in AI

Discriminating Systems - Gender, Race, and Power in AI

Blockwise Parallel Decoding for Deep Autoregressive Models

Blockwise Parallel Decoding for Deep Autoregressive Models

S.H.E. - Search. Human. Equalizer.

S.H.E. - Search. Human. Equalizer.

Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow

Adversarial Examples Are Not Bugs, They Are Features

Adversarial Examples Are Not Bugs, They Are Features

I'm at ICML19 :)

I'm at ICML19 :)

Population-Based Search and Open-Ended Algorithms

Population-Based Search and Open-Ended Algorithms

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Conversation about Population-Based Methods (Re-upload)

Conversation about Population-Based Methods (Re-upload)

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Manifold Mixup: Better Representations by Interpolating Hidden States

Manifold Mixup: Better Representations by Interpolating Hidden States

Processing Megapixel Images with Deep Attention-Sampling Models

Processing Megapixel Images with Deep Attention-Sampling Models

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Auditing Radicalization Pathways on YouTube

Auditing Radicalization Pathways on YouTube

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules

DEEP LEARNING MEME REVIEW - Episode 1

DEEP LEARNING MEME REVIEW - Episode 1

Accelerating Deep Learning by Focusing on the Biggest Losers

Accelerating Deep Learning by Focusing on the Biggest Losers

[News] The Siraj Raval Controversy

[News] The Siraj Raval Controversy

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

The Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model learns successor representations in partially observable environments

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

NeurIPS 19 Poster Session

NeurIPS 19 Poster Session

Go-Explore: a New Approach for Hard-Exploration Problems

Go-Explore: a New Approach for Hard-Exploration Problems

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

Turing-NLG, DeepSpeed and the ZeRO optimizer

Turing-NLG, DeepSpeed and the ZeRO optimizer

Growing Neural Cellular Automata

Growing Neural Cellular Automata

NeurIPS 2020 Changes to Paper Submission Process

NeurIPS 2020 Changes to Paper Submission Process

Deep Learning for Symbolic Mathematics

Deep Learning for Symbolic Mathematics

Online Education - How I Make My Videos

Online Education - How I Make My Videos

[Rant] coronavirus

[Rant] coronavirus

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Agent57: Outperforming the Atari Human Benchmark

Agent57: Outperforming the Atari Human Benchmark

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

Dream to Control: Learning Behaviors by Latent Imagination

Dream to Control: Learning Behaviors by Latent Imagination

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

Evaluating NLP Models via Contrast Sets

Evaluating NLP Models via Contrast Sets

[Drama] Who invented Contrast Sets?

[Drama] Who invented Contrast Sets?

Weight Standardization is a technique that normalizes the weights of a neural network, leading to improved performance and state-of-the-art results when combined with GroupNorm. This technique is expected to become standard in the future, replacing Batch Normalization. By understanding Weight Standardization, developers can improve the performance of their neural networks and achieve better results in computer vision tasks.

Key Takeaways

Normalize the weights of a neural network layer by subtracting the mean and dividing by the standard deviation for each output channel
Use the normalized weights to transform the input data and obtain the layer output
Back propagate through the normalization operation to update the weights
Centering weights
Scaling weights
Applying group norm
Back propagating through weight standardization

💡 Weight Standardization normalizes the weights of a neural network, reducing variance and instability, and can be used in conjunction with GroupNorm, BatchNorm, or LayerNorm to achieve state-of-the-art results.

🔒 Pro feature: Ask AI to explain this lesson →

More on: CV Basics

View skill →

Identify Horses or Humans with TensorFlow and Vertex AI

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Apply OpenGL Texturing and Camera Systems

Apply OpenGL Texturing and Camera Systems

Aerial Image Segmentation with PyTorch

Aerial Image Segmentation with PyTorch

How to Install Stable Diffusion - automatic1111

How to Install Stable Diffusion - automatic1111

Sebastian Kamph

NVIDIA RTXGI Unreal Engine 4 Plugin: Introduction and Setup

NVIDIA RTXGI Unreal Engine 4 Plugin: Introduction and Setup

NVIDIA Developer

Related Reads

AI 3D Object Reconstruction for Crime Scenes

Learn how AI 3D object reconstruction can aid crime scene investigations by creating detailed, accurate models of evidence and environments.

What Is YOLOv8? An Introduction to the YOLOv8 Model Family

Learn about YOLOv8, a family of models for computer vision tasks, and why multiple variants are offered

What Is YOLOv8? An Introduction to the YOLOv8 Model Family

Learn about YOLOv8, a family of models for computer vision tasks, and why it offers multiple variants

Medium · Data Science

Mistral's 8B Robostral Navigate outperforms multi-sensor robots

Mistral's 8B Robostral Navigate achieves superior performance with a single RGB camera, outperforming multi-sensor robots

Dev.to · ironbyte-rgb

9-Phase Computer Vision Roadmap 2026 | AI & Deep Learning | #shorts