Why Non-linear Activation Functions (C1W3L07)

DeepLearningAI · Beginner ·🧬 Deep Learning ·8y ago

Skills: ML Maths Basics80%Supervised Learning70%

Key Takeaways

The video discusses the importance of non-linear activation functions in neural networks, explaining why linear activation functions are not sufficient for computing interesting functions. It also touches on the use of linear activation functions in the output layer for regression problems.

Full Transcript

why does your neural network need a nonlinear activation function turns out that for your neural network to compute interesting functions you do need to take a nonlinear activation function less you want so just the for prop equations for the neural network why don't we just get rid of this get rid of the function G and set a1 equals Z 1 or alternatively you could say that G of Z is equal to Z right sometimes this is called the linear activation function maybe a better name for it would be the identity activation function because it was just outputs whatever was input for the purpose of this what if a2 was just equal to z2 it turns out if you do this then this model is just computing Y or Y hat as a linear function of your input features x2 take the first two equations if you have that a1 is equal to z1 is equal to w1 X plus B and if then a2 is equal to z2 is equal to W 2 a1 plus B then if you take the definition of a 1 and plug it in there you find that a 2 is equal to W 2 times W 1 X plus B 1 a bit right so this is on a 1 plus B 2 and so this simplifies to W 2 W 1 X plus W 2 B 1 plus B 2 so this is just let's call this w prime B prime so this is just equal to W prime X plus B Prime if you were to use linear activation functions or we go to call them identity activation functions then the new network is just outputting a linear function of the input and we'll talk about deep networks later neural networks with many many layers many many hidden layers and it turns out that if you use a linear activation function or alternatively if you don't have an activation function then no matter how many layers your neural network has always doing is just computing a linear activation function so you might as well not have any hidden layers some of the cases that briefly mentioned it turns out that if you have a linear activation function here and a sigmoid function here then this model is no more expressive than standard logistic regression without any hidden layer so I won't bother to prove that but you could try to do so if you want but to take home is that a linear hidden layer is more or less useless because on the composition of two linear functions is a sailfin linear function so unless you throw a non-linearity in there then you're not computing more interesting functions even as you go deeper in the network there is just one place where you might use a linear activation function G of Z equals Z and that's if you are doing machine learning on a regression problem so if y is a real number so for example if you're trying to predict housing prices so why is a there's not 0 1 but is a real number you know anywhere from zero dollars is a price of homes up to however expensive right how's the scale I guess maybe houses can be you know potentially millions of dollars so however however much houses cost in your data set but if Y takes on these real values then it might be ok to have a linear activation function here so that your output Y hat is also a real number going anywhere from minus infinity to plus infinity but then the hidden units should not use them your activation functions they could use value or 10 H or Li Q value or maybe something else so the one place you might use a linear activation function is usually in the output layer but other than that using a linear activation function in a hidden layer except for some very special circumstances relating to compression that won't want to talk about using the linear activation function is extremely rare oh and of course they're actually predicting housing prices as you saw in the week 1 video because housing prices are all non-negative perhaps even then you can use a rare loop activation function so that your outputs Y hat are all greater than or equal to 0 so I hope that gives you a sense of why having a nonlinear activation function is a critical part of neural networks next we're going to start to talk about gradient descent and to do that to set up discussion for gradient descent in the next video I want to show you how to estimate how to compute the slope of the derivative of individual activation functions so let's go on to the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2IcuTOr Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 40 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

Non-linear activation functions are crucial for neural networks to compute interesting functions. Linear activation functions are not sufficient and are rarely used in hidden layers. The video sets up the discussion for gradient descent and computing the slope of derivatives of individual activation functions.

Key Takeaways

Understand the limitation of linear activation functions
Recognize the importance of non-linear activation functions
Choose appropriate activation functions for hidden and output layers
Compute derivatives of activation functions
Implement neural networks for regression problems

💡 Non-linear activation functions are necessary for neural networks to compute interesting functions, and linear activation functions are rarely used in hidden layers.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train