Derivatives Of Activation Functions (C1W3L08)
Key Takeaways
The video discusses the derivatives of activation functions, including sigmoid, hyperbolic tangent, ReLU, and Leaky ReLU, which are essential for implementing back-propagation in neural networks. It provides formulas and examples to compute the derivatives of these functions.
Full Transcript
when you implement back-propagation for your neural network you need to really compute the slope or the derivative of the activation functions so let's take a look at our choices of activation functions and how you can compute the slope of these functions can see familiar sigmoid activation function and so for any given value of Z maybe this value of z this function will have some slope or some derivative corresponding to if you draw a rule line there you know the height over width of this little triangle here so if G of Z is the sigmoid function then the slope of the function is d DZ G of Z and so we know from calculus that this is the slope of G of X and Z and if you are familiar with calculus and know how to take derivatives if you take the derivative of the sigmoid function it is possible to show that it is equal to this formula and again I'm not going to do the calculus steps but if you're familiar with calculus feel free to pause the video and try to prove this yourself and so this is equal to just G of Z times 1 minus G of Z so let's just sanity check that this expression makes sense first if Z is very large so say Z is equal to 10 then G of Z will be close to 1 and so the form that we have on the Left tells us that D DZ G of Z does be close to G of Z which is equal to 1 times 1 minus 1 which is therefore very close to 0 and this isn't D correct because when Z is very launched the slope is close to 0 conversely of Z is equal to minus 10 so there's no way out there then G of Z is close to 0 so the following on the left tells us d DZ G of Z will be close to G of Z which is 0 times 1 line is 0 and so it is also very close to 0 or Sakura finally a Z is equal to zero then G of Z is equal to one-half as a sigmoid function right here and so the derivative is on equal to 1/2 times 1 minus 1/2 which is equal to 1/4 and that actually is turns out to be the correct value of the derivative or the slope of this function when Z is equal to 0 finally just to introduce one more piece of notation sometimes instead of writing this thing the shorthand for the derivative is G prime of Z so G prime of Z in calculus the the little dash on top is called time because of G prime of Z is a shorthand for the in calculus for the derivative of the function of G with respect to the input variable Z um and then in a neural network we have a equals G of Z right equals this then this formula also simplifies to a times 1 minus a so sometimes the implementation you might see something like G prime of Z equals a times 1 minus a and that just refers to you know the observation that G prime which is means derivative is equal to this over here and the advantage of this formula is that if you've already computed the value for a then by using this expression you can very quickly compute the value for the slope for G prime s all right so that was the sigmoid activation function let's now look at the Technic activation function similar to what we had previously the definition of d DZ G of Z is the slope of G of Z at a particular point of Z and if you look at the formula for the hyperbolic tangent function on any of you know calculus you can take derivatives and show that this simplifies to this formula and using the own shorthand we had previously when we call this G prime of Z you gain so if you want you can sanity check that this formula make sense so for example if Z is equal to 10 10 H of Z will be very close to 1 this goes from plus 1 to minus 1 and then G prime of Z according to this formula will be about 1 minus 1 squared so terms are equal to 0 so that was a Z is very large the slope is close to zero conversely a Z is very small say Z is equal to minus 10 then 10 H of Z will be close to minus 1 and so G prime of Z will be close to 1 minus negative 1 squared so it's close to 1 minus 1 which is also close to 0 and finally is equal to 0 then 10 H of Z is equal to 0 and then the slope is actually equal to 1 which is we selected a slope point um z is equal to 0 so just to summarize if a is equal to G of Z so if a is equal to this channel Z then the derivative G prime of Z is equal to 1 minus a squared so once again if you've already computed the value of a you can use this formula to very quickly compute the derivative as well finally here's how you compute the derivatives for the value and leakey relu activation functions for the value g of z is equal to max of 0 comma Z so the derivative is equal to you turns out to be 0 if Z is less than 0 and 1 if Z is greater than 0 and is actually our undefined technically undefined as V is equal to exactly 0 but um if you're implementing this in software it might not be a hundred percent mathematic correct but I work just fine if you it's V is exactly really zero if you set the derivative equal to 1 or decide to be zero it kind of doesn't matter if you're a Nixon of Malaysian technically G prime then becomes what's called a sub gradient of the activation function G of Z which is why gradient descent still works but you can think of it as that the chance of Z being you know zero point exactly zero zero zero is so small that it almost doesn't matter what you set the derivative to be equal to when Z is equal to zero so in practice this is what people implement for the derivative of Z and finally if you are trading on your own network with the we here a Luo activation function then G of Z is going to be max of say 0.01 Z comma Z and so G prime of Z is equal to 0.01 if Z is less than zero and 1 if Z is greater than zero and once again the gradient is technically not defined when Z is exactly equal to zero but if you implement a piece of code that sets the derivative or the essentially Prime's either a zero point zero one or two one either way it doesn't really matter when Z is exactly zero your co-workers so arms of these formulas you should either compute the slopes or the derivatives of your activation assumptions now we have this building blocks you're ready to see how to implement gradient descent for your neural network let's go into the next videos you see that
Original Description
Take the Deep Learning Specialization: http://bit.ly/2wksNJw
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 48 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
▶
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI