Adam Optimization Algorithm (C2W2L08)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

The Adam optimization algorithm is a popular and effective algorithm for training neural networks, combining the benefits of momentum and RMSProp. It is widely used in deep learning and has been shown to work well across a variety of architectures.

Full Transcript

during the history of deep learning many researchers including some very well-known researchers sometimes proposed optimization algorithms and show their work well in a few problems but those optimization algorithms subsequently will show not to really generalize that well to the wide range of neural networks you might want to train so over time I think the deep learning community actually develops some amount of skepticism about new optimization algorithms and a lot of people felt that you know gradient descent with momentum really works well is difficult to propose things that work much better so rmsprop and the atom optimization algorithm which to talk about in this video is one of those rare algorithms that has really stood up and has been shown to work well across a wide range of deep learning architectures so there's only algorithms that what it hesitate to recommend you try because many people have tried it and seen it work well on many problems and the atom optimization algorithm is basically taking momentum and rmsprop and putting them together so let's see how that works to implement atom you would initialize vgw equals 0 s DW equals 0 and similarly V DB s DB equals 0 and then on iteration T you would compute the reserve it is compute V WB be using current mini-batch so usually you do this with mini-batch gradient descent and then you do the momentum exponentially weighted average so VT w equals beta but now I'm going to call this beta 1 to distinguish it from the hyper parameter beta 2 we'll use for the RMS portion of this so this is exactly what we had when we're implementing momentum except that I've now called the hyper parameter beta 1 instead of beta and similarly you have V DB as follows 1 minus beta 1 x DB and then you do the rmsprop like updates as well so now you have a different hyper parameter beta 2 plus 1 minus beta 2 DW squared again the squaring there is element wise squaring of your derivatives BW and then s DB is equal to this plus 1 minus beta 2 times DB so this is the momentum like update with hyper param 2 beta 1 and this is the rmsprop like updating with hyper parameter beta 2 in the typical implementation of atom you do implement bias correction so you can f be perfected corrected means after bias correction DW equals v DW divided by 1 minus beta 1 so power of T if you've done T iterations and similarly B DB corrected equals V DV divided by 1 minus beta 1 to the power of T and then similarly you implement this on bias correction on s as well so that's s DW divided by 1 minus beta 2 to the T and s DB corrected equals s DB divided by 1 1 is beta 2 to the T finally you perform the update so W gets updated as W minus alpha at times so if you're just implementing momentum you'd use v DW or maybe VG w corrected but now we add in the rmsprop portion of this so we're also going to divide by square root of s DW corrected plus Epsilon and similarly B gets updated as similar formula the DP directed divided by square root s corrected DB plus Epsilon and so this algorithm combines the effect of gradient descent with momentum together with gradient descent of rmsprop and this is a commonly used learning algorithm that's proven to be very effective for many different neural networks of a very wide variety of architectures so this algorithm has a number of hyper parameters the learning rate hyper parameter alpha is still important and usually needs to be tuned so you just have to try range of values and see what works a comment or is really the default choice for beta 1 is 0.9 so this is the moving average wrote an average of DWI this is the momentum light term the high parameter for beta 2 the authors of the Adam paper inventors the Adams album recommend 0.99 induces computing the moving weighted average of DW squared as well as DP squared and then epsilon the choice of epsilon doesn't matter very much but the authors of the advent paper recommended 10 to the minus 8 but this parameter you really don't need to set it and it doesn't affect performance much at all but when implementing atom what people usually do is just use the default values of beta 1 and beta 2 as well as Epsilon I don't think anyone ever really choose epsilon and then try a range of values of alpha to see what works best you can also tune beta 1 and beta 2 but it's not done that often among the practitioners I know so where does the term atom come from atom stands for adaptive moment estimation so beta 1 is computing the mean of the derivatives this is called the first moment and beta 2 is used compute exponentially weighted average of the squares and that's called the second moment so that gives rise to named adaptive moment estimation but everyone just calls it the atom also invention and by the way one of my long-term friends and collaborators is called atom codes far as I know this algorithm doesn't have anything to do with him except for the fact that I think he uses it sometimes but sometimes I get off that question so just in case you're wondering so that's it for the atom optimization algorithm with it I think you really train your neural networks much more quickly but before we wrap up for this week let's keep talking about hyper parameter tuning as was getting some more intuitions about what the optimization problem from your networks look like in the next video we'll talk about learning rate decay

Original Description

Take the Deep Learning Specialization: http://bit.ly/2vBG4xl Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 22 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

The Adam optimization algorithm is a widely used and effective algorithm for training neural networks. It combines the benefits of momentum and RMSProp and has been shown to work well across a variety of architectures. In this video, we learn how to implement the Adam optimization algorithm and tune its hyperparameters.

Key Takeaways
  1. Initialize variables for Adam optimization algorithm
  2. Compute exponentially weighted average of derivatives
  3. Compute exponentially weighted average of squared derivatives
  4. Perform bias correction
  5. Update weights using Adam optimization algorithm
💡 The Adam optimization algorithm combines the benefits of momentum and RMSProp, making it a widely used and effective algorithm for training neural networks.

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →