Softmax Regression (C2W3L08)
Key Takeaways
The video discusses Softmax Regression, a generalization of logistic regression for multi-class classification problems, and its application in neural networks, including the use of the softmax activation function to generate output probabilities.
Full Transcript
so far the classification examples we've talked about have used binary classification where you had two possible labels zero or one is in a cat as an alley cat what if you have multiple possible classes there's a generalization of logistic regression called softmax regression that lets you make predictions where you're trying to recognize one of c or one of multiple classes rather than just recognize two classes let's take a look let's say that instead of just recognizing cats you want to recognize cats dogs and big kicks so I'm going to call cuts across $1 class to baby chicks cost three and there's none of the above then there's an other or none of the above calls which I'm going to call cost of zero so here's an example of the images and the classes they belong to that's a picture of a baby chick so the cost is three chances cross one a dog is cost - that's a guess that's a koala so that's a none of the above so that's called zero cost three and so on so the notation we're going to use is I'm going to use capital C to denote the number of classes you're trying to categorize your inputs into and in this case you have four possible classes including the other ordinality above costs so when you're four courses the numbers indexing your classes would be 0 to C minus 1 to capital C minus 1 so in other words would be 0 1 2 or 3 in this case we're going to build a new network where the output Slayer has 4 or in this case the variable capital alphabet C output units so n the number of units in the output layer which is where L is going to be equal to 4 or more generally is going to be equal to C and what we want is for the number of units and alquiler to tell us once the probability of each of these core classes so the first node here is supposed to output or we wanted to output the probability that is the other class given G and collects so output probability there's a cat given X this will output probability that is a dog given X that will output the probability I'm just going to abbreviate baby-shaped to BBC so probably on a baby chick abbreviated BC given the inferred X so here the output labels my hat is going to be a four by one dimensional vector because it now has two output for numbers giving you these four probabilities and because probably should sum to one the phone numbers in the output Y hat they should sum to one the standard model for getting a neural network to do this uses what's called a Softbank layer in the output layer in order to generate these outputs let me write down the map and then come back and do some intuition about what the soft Maclin air is doing so in the final layer in your network you are going to compute as usual the linear part of the layer so the capital L that's the Z variable for the final layer so remember this is layer capital L so as usual you compute that as WL times the activation the previous layer plus the biases for that final layer now having computed Z's you now need to apply what's called the softmax activation function so the activation function is a bit unusual for the soft mask layer but this is what it does first we're going to compute a temporary variable which we call T which is e to the Z L so this is a plot element wise so VL here in our example ZL is going to be four point one is a four dimensional vector so T itself e to the GL does an element-wise exponentiation T will also be a four by one dimensional vector then the output al is going to be basically the vector T but normalized to sum to 1 so al s going to e to the Z l / sum from J equals one through four because there are four classes of T subscript I so another way of saying this is that al is also a four point one vector and the I've elements of this four dimensional vector let's vary that al subscript I this can be equal to GI over some of GI okay in case this map isn't clear we'll do an example in a minute that make this clearer so in cases map isn't clear let's go through a specific example that will make this clearer let's say then your computer's VL + ZL is a four dimensional vector let's say is 5 to negative 1/3 what we're going to do is use this element wise exponentiation to compute this vector team so T is going to be e to the 5 e to the 2 e 2 negative 1 e to the 3 and if you present a calculator these are the values you get e to the 5 is 1 4 8 4 e squared is about 7 point 4 e to the negative 1 is 0.4 and études is 20 point 1 and so the way we go from the vector T to the vector al is to just normalize these entries to sum to 1 so if you sum up the elements of T if you just add up those whole numbers you get 1 7 6 points 3 so finally a ll is just going to be this vector T as a vector divided by 1 7 6 on 3 so for example this first node here this will output e to the 5 divided by 1 7 6 point 3 and that turns out to be a zero point 8 4 2 so saying that for this image if this is the value of V you get the chance of it being called 0 is 84 two percent and then the next node outputs B squared over one seven six point three that turns out to be zero point zero four two six four point two percent chance the next one is e to negative one over that which is 0.02 and the final one is etude over that which is zero point one 1/4 so brother another point four percent challenge that this is cost number three which I guess is the baby chick cost right so there's a chance of it being called zero cause 1 plus 2 Plus 3 so the output of in your network a L this is also Y hat this is a four by one vector where the elements of this for ball one vector are going to be these four numbers that we just computed so this algorithm takes the vector Z L and AB set to four probabilities that sum to one and if we summarize what we just did to map from ZL to Al this whole computation confusing the exponentiation to get this temporary variable T and then normalizing we can summarize this into a softmax activation function and say Al equals the activation function G applied to the vector ZL the unusual thing about this book activation function is done this activation function G it takes as input a 4 by 1 vector and it opens a 4 by 1 vector so previously our activation functions used to take in a single real value input so for example the sigmoid and the value activation functions input a real number and output a real number the unusual thing about the softmax activation function is because we need to normalize across the different possible prism used to take in a vector of inputs and an opposite vector so one of the things that a softmax crossbar can represent I'm going to show you some examples where you have inputs x1 x2 and these feed directly to a softmax layer that has 3 or four or more aqua notes that then opens why hat so going to show you a neural network with no hidden layer and all it does is compute z1 equals w1 times the input X plus B and then the output a 1 or Y hat is just the softmax activation function applied to z1 so in this neural network with no hidden there should give you a sense of the types of things a softmax function can represent so here's one example with just raw inputs x1 and x2 a softmax layer with C equals V output causes can represent this type of decision boundary now this is kind of a several linear decision boundaries but this allows it to separate out the data into three classes and in this diagram what we did was we actually took the training set is kind of shown in this figure and train a cost function and train the softmax classifier with three upper labels on the data and then the color on this plot shows fresh holding the outputs in the salt bags crossfire and coloring in the input based on which one of the three outputs had the highest probability so you can maybe kind of see that this is like a generalization of logistic regression with sort of linear decision boundaries but with more than two classes but mostafa call has been just 0 1 the cost can be 0 1 or 2 here's another example of decision boundary that a Sakai's classifier represents when Turing on a data set with three classes and here's another one right so this is up but one intuition is that decision boundary between any two classes well it will be linear that's why you see for example the decision boundary between the yellow and the gray classes that's or the linear boundary region purple bird is not lineage and boundary to the purple yellow is another limitation boundary but you know is able to use the different linear functions in order to separate the space into three classes some examples with more classes so this is example with C equals 4 so that the dream class and softmax can continue to represent these types of lineages and boundaries between multiple classes so here's one more example with C equals 5 classes and here's one last example with C equals 6 so this shows the type of things that softmax also I can do when there is no hidden there of course you have a much deeper inner network with X and then you know some picking unions and more hidden unions and so on then you could learn even more complex nonlinear decision boundaries to separate out multiple different classes so I hope this gives you a sense of what a softmax player what a softmax activation function in a neural network can do in the next video let's take a look at how you can train in your network that uses a software layer
Original Description
Take the Deep Learning Specialization: http://bit.ly/2xdG0Et
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 27 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
▶
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: Supervised Learning
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI