Why Non-linear Activation Functions (C1W3L07)
Key Takeaways
The video discusses the importance of non-linear activation functions in neural networks, explaining why linear activation functions are not sufficient for computing interesting functions. It also touches on the use of linear activation functions in the output layer for regression problems.
Full Transcript
why does your neural network need a nonlinear activation function turns out that for your neural network to compute interesting functions you do need to take a nonlinear activation function less you want so just the for prop equations for the neural network why don't we just get rid of this get rid of the function G and set a1 equals Z 1 or alternatively you could say that G of Z is equal to Z right sometimes this is called the linear activation function maybe a better name for it would be the identity activation function because it was just outputs whatever was input for the purpose of this what if a2 was just equal to z2 it turns out if you do this then this model is just computing Y or Y hat as a linear function of your input features x2 take the first two equations if you have that a1 is equal to z1 is equal to w1 X plus B and if then a2 is equal to z2 is equal to W 2 a1 plus B then if you take the definition of a 1 and plug it in there you find that a 2 is equal to W 2 times W 1 X plus B 1 a bit right so this is on a 1 plus B 2 and so this simplifies to W 2 W 1 X plus W 2 B 1 plus B 2 so this is just let's call this w prime B prime so this is just equal to W prime X plus B Prime if you were to use linear activation functions or we go to call them identity activation functions then the new network is just outputting a linear function of the input and we'll talk about deep networks later neural networks with many many layers many many hidden layers and it turns out that if you use a linear activation function or alternatively if you don't have an activation function then no matter how many layers your neural network has always doing is just computing a linear activation function so you might as well not have any hidden layers some of the cases that briefly mentioned it turns out that if you have a linear activation function here and a sigmoid function here then this model is no more expressive than standard logistic regression without any hidden layer so I won't bother to prove that but you could try to do so if you want but to take home is that a linear hidden layer is more or less useless because on the composition of two linear functions is a sailfin linear function so unless you throw a non-linearity in there then you're not computing more interesting functions even as you go deeper in the network there is just one place where you might use a linear activation function G of Z equals Z and that's if you are doing machine learning on a regression problem so if y is a real number so for example if you're trying to predict housing prices so why is a there's not 0 1 but is a real number you know anywhere from zero dollars is a price of homes up to however expensive right how's the scale I guess maybe houses can be you know potentially millions of dollars so however however much houses cost in your data set but if Y takes on these real values then it might be ok to have a linear activation function here so that your output Y hat is also a real number going anywhere from minus infinity to plus infinity but then the hidden units should not use them your activation functions they could use value or 10 H or Li Q value or maybe something else so the one place you might use a linear activation function is usually in the output layer but other than that using a linear activation function in a hidden layer except for some very special circumstances relating to compression that won't want to talk about using the linear activation function is extremely rare oh and of course they're actually predicting housing prices as you saw in the week 1 video because housing prices are all non-negative perhaps even then you can use a rare loop activation function so that your outputs Y hat are all greater than or equal to 0 so I hope that gives you a sense of why having a nonlinear activation function is a critical part of neural networks next we're going to start to talk about gradient descent and to do that to set up discussion for gradient descent in the next video I want to show you how to estimate how to compute the slope of the derivative of individual activation functions so let's go on to the next video
Original Description
Take the Deep Learning Specialization: http://bit.ly/2IcuTOr
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 40 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
▶
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
🎓
Tutor Explanation
DeepCamp AI