The Problem of Local Optima (C2W3L10)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

The video discusses the problem of local optima in deep learning, explaining that local optima are unlikely to occur in high-dimensional spaces and that saddle points are more common, and introduces the concept of plateaus which can slow down learning.

Full Transcript

in the early days of deep learning people used to worry a lot about the optimization algorithm getting stuck in bad local optima but as the theory of deep learning has advanced our understanding of local optima is also changing let me show you how we now think about local optima and problems in the optimization problem in deep learning so this was a picture people used to have in mind when they worried about local optima maybe you're trying to optimize some set of parameters and we call them W 1 and W 2 and the height of the surface is the cost function so in this picture it looks like there are a lot of local optima you know in in all those places and it'd be easy for gradient descents or one of the other algorithms to get stuck on a local optimum rather than find this way to a global optimum it turns out that if you are plotting a figure like this in two dimensions then it's easy to create plots like this of a lot of different local optima and these very low dimensional plots used to gather intuition but this intuition isn't actually correct it turns out if you create in your network most points of 0 gradients are not local optima like points like this instead most points of 0 gradients in the cost function are actually saddle points so that's a point with a zero gradient again this is maybe W 1 W 2 and the highest heightens the value of the cost function J but informally a function in a very high dimensional space if the gradient is 0 then in each direction it can either be a convex light function or a concave light function and if you are in say a 20,000 dimensional space then thread to be a local optima all 20,000 directions need to look like this and so the chance of that happening is maybe very small you know maybe 2 to the minus 20000 instead you're much more likely to get some directions where the curve bends up like so as well some directions where the function is bending down rather than have them all Bend upwards so that's why in very high dimensional spaces you're actually much more likely to run into a saddle points like that shown on the right then local optimum oh and as for why the surface is called a saddle point if you can picture maybe this is a sort of shadow you put on a horse right so maybe if this is a horse I guess there's a head of a horse as you I have a horse you know I guess and right well another great drawing of a horse but you get the idea then you the rider will sit here in the saddle so then so that's why this point here where the derivative is zero that point is called a saddle point it's really the point to understand where you're sitting s and that happens to have you know derivative zero and so one of the lessons we learned in history of deep learning is that a lot of our intuitions about low dimensional spaces like what you can plot on the left they really don't transfer to the very high dimensional spaces then our learning algorithms are operating over because if you have twenty thousand parameters then J is V a function over a twenty thousand dimensional vector and you're much more likely to see saddle points than local optimum if local optima aren't a problem then what is a problem it turns out that plateaus can really slow down learning and the plateau is a region where the derivative is close to zero for a long time so if you are here then gradient descent will move down the surface and because the gradient is zero or near zero the surface is quite flat you can actually take a very long time you know to slowly find your way to maybe this point on the plateau and then because of a random perturbation to the left or right maybe then finally I'm gonna switch pen colors for clarity your algorithm can then find this way off the plateau but then to take this very long slope off before it's found this way here and they could get off this plateau so the takeaways from this video are first you actually pretty unlikely to get stuck in bad local optima so long as you're training and reasonably launched new network save a lot of parameters and the cost function J is defined over a relatively high dimensional space but second that plateaus are a problem and they can actually make learning pretty slow and this is where algorithms like momentum or our most proper atom can really help you learning algorithm as well and these are scenarios where more sophisticated optimization algorithms such as atom can actually speed up the rate at which you could move down the plateau and then get off the plateau so because your networks are solving optimization problems over such high dimensional spaces to be honest I don't think anyone has great intuitions about what these spaces really look like and our understanding of them is still evolving but I hope this gives you some better intuition about the challenges that the optimization algorithms may face so that congratulations on coming to the end of this week's content please take a look at this week's quiz as well as the exercise I hope you enjoyed practicing some of these ideas with this week's forum exercise and I look forward to seeing you at the start of next week's videos

Original Description

Take the Deep Learning Specialization: http://bit.ly/39xFIXq Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 14 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

The video explains that local optima are unlikely to occur in high-dimensional spaces and that saddle points are more common, and introduces the concept of plateaus which can slow down learning. It discusses how optimization algorithms can get stuck in these plateaus and how more sophisticated algorithms can help. The video aims to provide intuition about the challenges that optimization algorithms may face in high-dimensional spaces.

Key Takeaways
  1. Understand the concept of local optima
  2. Recognize the difference between local optima and saddle points
  3. Identify plateaus in high-dimensional spaces
  4. Apply gradient descent and other optimization algorithms
  5. Use more sophisticated algorithms to escape plateaus
💡 Local optima are unlikely to occur in high-dimensional spaces, and saddle points are more common, but plateaus can still slow down learning.

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →