Softmax Regression (C2W3L08)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Skills: Supervised Learning90%ML Maths Basics80%

Key Takeaways

The video discusses Softmax Regression, a generalization of logistic regression for multi-class classification problems, and its application in neural networks, including the use of the softmax activation function to generate output probabilities.

Full Transcript

so far the classification examples we've talked about have used binary classification where you had two possible labels zero or one is in a cat as an alley cat what if you have multiple possible classes there's a generalization of logistic regression called softmax regression that lets you make predictions where you're trying to recognize one of c or one of multiple classes rather than just recognize two classes let's take a look let's say that instead of just recognizing cats you want to recognize cats dogs and big kicks so I'm going to call cuts across $1 class to baby chicks cost three and there's none of the above then there's an other or none of the above calls which I'm going to call cost of zero so here's an example of the images and the classes they belong to that's a picture of a baby chick so the cost is three chances cross one a dog is cost - that's a guess that's a koala so that's a none of the above so that's called zero cost three and so on so the notation we're going to use is I'm going to use capital C to denote the number of classes you're trying to categorize your inputs into and in this case you have four possible classes including the other ordinality above costs so when you're four courses the numbers indexing your classes would be 0 to C minus 1 to capital C minus 1 so in other words would be 0 1 2 or 3 in this case we're going to build a new network where the output Slayer has 4 or in this case the variable capital alphabet C output units so n the number of units in the output layer which is where L is going to be equal to 4 or more generally is going to be equal to C and what we want is for the number of units and alquiler to tell us once the probability of each of these core classes so the first node here is supposed to output or we wanted to output the probability that is the other class given G and collects so output probability there's a cat given X this will output probability that is a dog given X that will output the probability I'm just going to abbreviate baby-shaped to BBC so probably on a baby chick abbreviated BC given the inferred X so here the output labels my hat is going to be a four by one dimensional vector because it now has two output for numbers giving you these four probabilities and because probably should sum to one the phone numbers in the output Y hat they should sum to one the standard model for getting a neural network to do this uses what's called a Softbank layer in the output layer in order to generate these outputs let me write down the map and then come back and do some intuition about what the soft Maclin air is doing so in the final layer in your network you are going to compute as usual the linear part of the layer so the capital L that's the Z variable for the final layer so remember this is layer capital L so as usual you compute that as WL times the activation the previous layer plus the biases for that final layer now having computed Z's you now need to apply what's called the softmax activation function so the activation function is a bit unusual for the soft mask layer but this is what it does first we're going to compute a temporary variable which we call T which is e to the Z L so this is a plot element wise so VL here in our example ZL is going to be four point one is a four dimensional vector so T itself e to the GL does an element-wise exponentiation T will also be a four by one dimensional vector then the output al is going to be basically the vector T but normalized to sum to 1 so al s going to e to the Z l / sum from J equals one through four because there are four classes of T subscript I so another way of saying this is that al is also a four point one vector and the I've elements of this four dimensional vector let's vary that al subscript I this can be equal to GI over some of GI okay in case this map isn't clear we'll do an example in a minute that make this clearer so in cases map isn't clear let's go through a specific example that will make this clearer let's say then your computer's VL + ZL is a four dimensional vector let's say is 5 to negative 1/3 what we're going to do is use this element wise exponentiation to compute this vector team so T is going to be e to the 5 e to the 2 e 2 negative 1 e to the 3 and if you present a calculator these are the values you get e to the 5 is 1 4 8 4 e squared is about 7 point 4 e to the negative 1 is 0.4 and études is 20 point 1 and so the way we go from the vector T to the vector al is to just normalize these entries to sum to 1 so if you sum up the elements of T if you just add up those whole numbers you get 1 7 6 points 3 so finally a ll is just going to be this vector T as a vector divided by 1 7 6 on 3 so for example this first node here this will output e to the 5 divided by 1 7 6 point 3 and that turns out to be a zero point 8 4 2 so saying that for this image if this is the value of V you get the chance of it being called 0 is 84 two percent and then the next node outputs B squared over one seven six point three that turns out to be zero point zero four two six four point two percent chance the next one is e to negative one over that which is 0.02 and the final one is etude over that which is zero point one 1/4 so brother another point four percent challenge that this is cost number three which I guess is the baby chick cost right so there's a chance of it being called zero cause 1 plus 2 Plus 3 so the output of in your network a L this is also Y hat this is a four by one vector where the elements of this for ball one vector are going to be these four numbers that we just computed so this algorithm takes the vector Z L and AB set to four probabilities that sum to one and if we summarize what we just did to map from ZL to Al this whole computation confusing the exponentiation to get this temporary variable T and then normalizing we can summarize this into a softmax activation function and say Al equals the activation function G applied to the vector ZL the unusual thing about this book activation function is done this activation function G it takes as input a 4 by 1 vector and it opens a 4 by 1 vector so previously our activation functions used to take in a single real value input so for example the sigmoid and the value activation functions input a real number and output a real number the unusual thing about the softmax activation function is because we need to normalize across the different possible prism used to take in a vector of inputs and an opposite vector so one of the things that a softmax crossbar can represent I'm going to show you some examples where you have inputs x1 x2 and these feed directly to a softmax layer that has 3 or four or more aqua notes that then opens why hat so going to show you a neural network with no hidden layer and all it does is compute z1 equals w1 times the input X plus B and then the output a 1 or Y hat is just the softmax activation function applied to z1 so in this neural network with no hidden there should give you a sense of the types of things a softmax function can represent so here's one example with just raw inputs x1 and x2 a softmax layer with C equals V output causes can represent this type of decision boundary now this is kind of a several linear decision boundaries but this allows it to separate out the data into three classes and in this diagram what we did was we actually took the training set is kind of shown in this figure and train a cost function and train the softmax classifier with three upper labels on the data and then the color on this plot shows fresh holding the outputs in the salt bags crossfire and coloring in the input based on which one of the three outputs had the highest probability so you can maybe kind of see that this is like a generalization of logistic regression with sort of linear decision boundaries but with more than two classes but mostafa call has been just 0 1 the cost can be 0 1 or 2 here's another example of decision boundary that a Sakai's classifier represents when Turing on a data set with three classes and here's another one right so this is up but one intuition is that decision boundary between any two classes well it will be linear that's why you see for example the decision boundary between the yellow and the gray classes that's or the linear boundary region purple bird is not lineage and boundary to the purple yellow is another limitation boundary but you know is able to use the different linear functions in order to separate the space into three classes some examples with more classes so this is example with C equals 4 so that the dream class and softmax can continue to represent these types of lineages and boundaries between multiple classes so here's one more example with C equals 5 classes and here's one last example with C equals 6 so this shows the type of things that softmax also I can do when there is no hidden there of course you have a much deeper inner network with X and then you know some picking unions and more hidden unions and so on then you could learn even more complex nonlinear decision boundaries to separate out multiple different classes so I hope this gives you a sense of what a softmax player what a softmax activation function in a neural network can do in the next video let's take a look at how you can train in your network that uses a software layer

Original Description

Take the Deep Learning Specialization: http://bit.ly/2xdG0Et Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 27 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video teaches the basics of Softmax Regression, including its application in multi-class classification problems and its use in neural networks. It covers the softmax activation function and how it generates output probabilities. By watching this video, viewers can learn how to train a softmax classifier and build a neural network for multi-class classification.

Key Takeaways

Build a new network with an output layer that has C output units
Compute the linear output of the final layer, Z_L, as WL * activation(previous layer) + biases
Use the softmax activation function to compute a vector of probabilities that sum to 1
Train a softmax classifier with multiple classes

💡 The softmax activation function can be used to represent a decision boundary that separates data into multiple classes, and it can be used in neural networks with multiple hidden layers to learn complex nonlinear decision boundaries

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Related AI Lessons

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train