Derivatives Of Activation Functions (C1W3L08)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Key Takeaways

The video discusses the derivatives of activation functions, including sigmoid, hyperbolic tangent, ReLU, and Leaky ReLU, which are essential for implementing back-propagation in neural networks. It provides formulas and examples to compute the derivatives of these functions.

Full Transcript

when you implement back-propagation for your neural network you need to really compute the slope or the derivative of the activation functions so let's take a look at our choices of activation functions and how you can compute the slope of these functions can see familiar sigmoid activation function and so for any given value of Z maybe this value of z this function will have some slope or some derivative corresponding to if you draw a rule line there you know the height over width of this little triangle here so if G of Z is the sigmoid function then the slope of the function is d DZ G of Z and so we know from calculus that this is the slope of G of X and Z and if you are familiar with calculus and know how to take derivatives if you take the derivative of the sigmoid function it is possible to show that it is equal to this formula and again I'm not going to do the calculus steps but if you're familiar with calculus feel free to pause the video and try to prove this yourself and so this is equal to just G of Z times 1 minus G of Z so let's just sanity check that this expression makes sense first if Z is very large so say Z is equal to 10 then G of Z will be close to 1 and so the form that we have on the Left tells us that D DZ G of Z does be close to G of Z which is equal to 1 times 1 minus 1 which is therefore very close to 0 and this isn't D correct because when Z is very launched the slope is close to 0 conversely of Z is equal to minus 10 so there's no way out there then G of Z is close to 0 so the following on the left tells us d DZ G of Z will be close to G of Z which is 0 times 1 line is 0 and so it is also very close to 0 or Sakura finally a Z is equal to zero then G of Z is equal to one-half as a sigmoid function right here and so the derivative is on equal to 1/2 times 1 minus 1/2 which is equal to 1/4 and that actually is turns out to be the correct value of the derivative or the slope of this function when Z is equal to 0 finally just to introduce one more piece of notation sometimes instead of writing this thing the shorthand for the derivative is G prime of Z so G prime of Z in calculus the the little dash on top is called time because of G prime of Z is a shorthand for the in calculus for the derivative of the function of G with respect to the input variable Z um and then in a neural network we have a equals G of Z right equals this then this formula also simplifies to a times 1 minus a so sometimes the implementation you might see something like G prime of Z equals a times 1 minus a and that just refers to you know the observation that G prime which is means derivative is equal to this over here and the advantage of this formula is that if you've already computed the value for a then by using this expression you can very quickly compute the value for the slope for G prime s all right so that was the sigmoid activation function let's now look at the Technic activation function similar to what we had previously the definition of d DZ G of Z is the slope of G of Z at a particular point of Z and if you look at the formula for the hyperbolic tangent function on any of you know calculus you can take derivatives and show that this simplifies to this formula and using the own shorthand we had previously when we call this G prime of Z you gain so if you want you can sanity check that this formula make sense so for example if Z is equal to 10 10 H of Z will be very close to 1 this goes from plus 1 to minus 1 and then G prime of Z according to this formula will be about 1 minus 1 squared so terms are equal to 0 so that was a Z is very large the slope is close to zero conversely a Z is very small say Z is equal to minus 10 then 10 H of Z will be close to minus 1 and so G prime of Z will be close to 1 minus negative 1 squared so it's close to 1 minus 1 which is also close to 0 and finally is equal to 0 then 10 H of Z is equal to 0 and then the slope is actually equal to 1 which is we selected a slope point um z is equal to 0 so just to summarize if a is equal to G of Z so if a is equal to this channel Z then the derivative G prime of Z is equal to 1 minus a squared so once again if you've already computed the value of a you can use this formula to very quickly compute the derivative as well finally here's how you compute the derivatives for the value and leakey relu activation functions for the value g of z is equal to max of 0 comma Z so the derivative is equal to you turns out to be 0 if Z is less than 0 and 1 if Z is greater than 0 and is actually our undefined technically undefined as V is equal to exactly 0 but um if you're implementing this in software it might not be a hundred percent mathematic correct but I work just fine if you it's V is exactly really zero if you set the derivative equal to 1 or decide to be zero it kind of doesn't matter if you're a Nixon of Malaysian technically G prime then becomes what's called a sub gradient of the activation function G of Z which is why gradient descent still works but you can think of it as that the chance of Z being you know zero point exactly zero zero zero is so small that it almost doesn't matter what you set the derivative to be equal to when Z is equal to zero so in practice this is what people implement for the derivative of Z and finally if you are trading on your own network with the we here a Luo activation function then G of Z is going to be max of say 0.01 Z comma Z and so G prime of Z is equal to 0.01 if Z is less than zero and 1 if Z is greater than zero and once again the gradient is technically not defined when Z is exactly equal to zero but if you implement a piece of code that sets the derivative or the essentially Prime's either a zero point zero one or two one either way it doesn't really matter when Z is exactly zero your co-workers so arms of these formulas you should either compute the slopes or the derivatives of your activation assumptions now we have this building blocks you're ready to see how to implement gradient descent for your neural network let's go into the next videos you see that

Original Description

Take the Deep Learning Specialization: http://bit.ly/2wksNJw Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 48 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

The video teaches how to compute the derivatives of common activation functions used in neural networks, which is crucial for implementing back-propagation. It provides formulas and examples for sigmoid, hyperbolic tangent, ReLU, and Leaky ReLU activation functions.

Key Takeaways
  1. Compute the derivative of the sigmoid activation function using the formula G prime of Z = G of Z * (1 - G of Z)
  2. Compute the derivative of the hyperbolic tangent activation function using the formula G prime of Z = 1 - (G of Z)^2
  3. Compute the derivative of the ReLU activation function using the formula G prime of Z = 0 if Z < 0 and 1 if Z > 0
  4. Compute the derivative of the Leaky ReLU activation function using the formula G prime of Z = 0.01 if Z < 0 and 1 if Z > 0
💡 The derivatives of activation functions are essential for implementing back-propagation in neural networks, and the formulas provided in the video can be used to compute these derivatives.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →