Softmax Regression (C2W3L08)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Key Takeaways

The video discusses Softmax Regression, a generalization of logistic regression for multi-class classification problems, and its application in neural networks, including the use of the softmax activation function to generate output probabilities.

Full Transcript

so far the classification examples we've talked about have used binary classification where you had two possible labels zero or one is in a cat as an alley cat what if you have multiple possible classes there's a generalization of logistic regression called softmax regression that lets you make predictions where you're trying to recognize one of c or one of multiple classes rather than just recognize two classes let's take a look let's say that instead of just recognizing cats you want to recognize cats dogs and big kicks so I'm going to call cuts across $1 class to baby chicks cost three and there's none of the above then there's an other or none of the above calls which I'm going to call cost of zero so here's an example of the images and the classes they belong to that's a picture of a baby chick so the cost is three chances cross one a dog is cost - that's a guess that's a koala so that's a none of the above so that's called zero cost three and so on so the notation we're going to use is I'm going to use capital C to denote the number of classes you're trying to categorize your inputs into and in this case you have four possible classes including the other ordinality above costs so when you're four courses the numbers indexing your classes would be 0 to C minus 1 to capital C minus 1 so in other words would be 0 1 2 or 3 in this case we're going to build a new network where the output Slayer has 4 or in this case the variable capital alphabet C output units so n the number of units in the output layer which is where L is going to be equal to 4 or more generally is going to be equal to C and what we want is for the number of units and alquiler to tell us once the probability of each of these core classes so the first node here is supposed to output or we wanted to output the probability that is the other class given G and collects so output probability there's a cat given X this will output probability that is a dog given X that will output the probability I'm just going to abbreviate baby-shaped to BBC so probably on a baby chick abbreviated BC given the inferred X so here the output labels my hat is going to be a four by one dimensional vector because it now has two output for numbers giving you these four probabilities and because probably should sum to one the phone numbers in the output Y hat they should sum to one the standard model for getting a neural network to do this uses what's called a Softbank layer in the output layer in order to generate these outputs let me write down the map and then come back and do some intuition about what the soft Maclin air is doing so in the final layer in your network you are going to compute as usual the linear part of the layer so the capital L that's the Z variable for the final layer so remember this is layer capital L so as usual you compute that as WL times the activation the previous layer plus the biases for that final layer now having computed Z's you now need to apply what's called the softmax activation function so the activation function is a bit unusual for the soft mask layer but this is what it does first we're going to compute a temporary variable which we call T which is e to the Z L so this is a plot element wise so VL here in our example ZL is going to be four point one is a four dimensional vector so T itself e to the GL does an element-wise exponentiation T will also be a four by one dimensional vector then the output al is going to be basically the vector T but normalized to sum to 1 so al s going to e to the Z l / sum from J equals one through four because there are four classes of T subscript I so another way of saying this is that al is also a four point one vector and the I've elements of this four dimensional vector let's vary that al subscript I this can be equal to GI over some of GI okay in case this map isn't clear we'll do an example in a minute that make this clearer so in cases map isn't clear let's go through a specific example that will make this clearer let's say then your computer's VL + ZL is a four dimensional vector let's say is 5 to negative 1/3 what we're going to do is use this element wise exponentiation to compute this vector team so T is going to be e to the 5 e to the 2 e 2 negative 1 e to the 3 and if you present a calculator these are the values you get e to the 5 is 1 4 8 4 e squared is about 7 point 4 e to the negative 1 is 0.4 and études is 20 point 1 and so the way we go from the vector T to the vector al is to just normalize these entries to sum to 1 so if you sum up the elements of T if you just add up those whole numbers you get 1 7 6 points 3 so finally a ll is just going to be this vector T as a vector divided by 1 7 6 on 3 so for example this first node here this will output e to the 5 divided by 1 7 6 point 3 and that turns out to be a zero point 8 4 2 so saying that for this image if this is the value of V you get the chance of it being called 0 is 84 two percent and then the next node outputs B squared over one seven six point three that turns out to be zero point zero four two six four point two percent chance the next one is e to negative one over that which is 0.02 and the final one is etude over that which is zero point one 1/4 so brother another point four percent challenge that this is cost number three which I guess is the baby chick cost right so there's a chance of it being called zero cause 1 plus 2 Plus 3 so the output of in your network a L this is also Y hat this is a four by one vector where the elements of this for ball one vector are going to be these four numbers that we just computed so this algorithm takes the vector Z L and AB set to four probabilities that sum to one and if we summarize what we just did to map from ZL to Al this whole computation confusing the exponentiation to get this temporary variable T and then normalizing we can summarize this into a softmax activation function and say Al equals the activation function G applied to the vector ZL the unusual thing about this book activation function is done this activation function G it takes as input a 4 by 1 vector and it opens a 4 by 1 vector so previously our activation functions used to take in a single real value input so for example the sigmoid and the value activation functions input a real number and output a real number the unusual thing about the softmax activation function is because we need to normalize across the different possible prism used to take in a vector of inputs and an opposite vector so one of the things that a softmax crossbar can represent I'm going to show you some examples where you have inputs x1 x2 and these feed directly to a softmax layer that has 3 or four or more aqua notes that then opens why hat so going to show you a neural network with no hidden layer and all it does is compute z1 equals w1 times the input X plus B and then the output a 1 or Y hat is just the softmax activation function applied to z1 so in this neural network with no hidden there should give you a sense of the types of things a softmax function can represent so here's one example with just raw inputs x1 and x2 a softmax layer with C equals V output causes can represent this type of decision boundary now this is kind of a several linear decision boundaries but this allows it to separate out the data into three classes and in this diagram what we did was we actually took the training set is kind of shown in this figure and train a cost function and train the softmax classifier with three upper labels on the data and then the color on this plot shows fresh holding the outputs in the salt bags crossfire and coloring in the input based on which one of the three outputs had the highest probability so you can maybe kind of see that this is like a generalization of logistic regression with sort of linear decision boundaries but with more than two classes but mostafa call has been just 0 1 the cost can be 0 1 or 2 here's another example of decision boundary that a Sakai's classifier represents when Turing on a data set with three classes and here's another one right so this is up but one intuition is that decision boundary between any two classes well it will be linear that's why you see for example the decision boundary between the yellow and the gray classes that's or the linear boundary region purple bird is not lineage and boundary to the purple yellow is another limitation boundary but you know is able to use the different linear functions in order to separate the space into three classes some examples with more classes so this is example with C equals 4 so that the dream class and softmax can continue to represent these types of lineages and boundaries between multiple classes so here's one more example with C equals 5 classes and here's one last example with C equals 6 so this shows the type of things that softmax also I can do when there is no hidden there of course you have a much deeper inner network with X and then you know some picking unions and more hidden unions and so on then you could learn even more complex nonlinear decision boundaries to separate out multiple different classes so I hope this gives you a sense of what a softmax player what a softmax activation function in a neural network can do in the next video let's take a look at how you can train in your network that uses a software layer

Original Description

Take the Deep Learning Specialization: http://bit.ly/2xdG0Et Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 27 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

This video teaches the basics of Softmax Regression, including its application in multi-class classification problems and its use in neural networks. It covers the softmax activation function and how it generates output probabilities. By watching this video, viewers can learn how to train a softmax classifier and build a neural network for multi-class classification.

Key Takeaways
  1. Build a new network with an output layer that has C output units
  2. Compute the linear output of the final layer, Z_L, as WL * activation(previous layer) + biases
  3. Use the softmax activation function to compute a vector of probabilities that sum to 1
  4. Train a softmax classifier with multiple classes
💡 The softmax activation function can be used to represent a decision boundary that separates data into multiple classes, and it can be used in neural networks with multiple hidden layers to learn complex nonlinear decision boundaries

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →