Gradient Checking Implementation Notes (C2W1L14)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Pipelines70%Supervised Learning60%

Key Takeaways

The video discusses practical tips for implementing gradient checking in neural networks, including using backpropagation to compute derivatives, turning off gradient checking during training, and debugging individual components of the gradient approximation. It also covers the importance of including regularization terms and handling dropout layers.

Full Transcript

in the last video you learned about gradient checking in this video I want to share you some practical tips or some notes on how to actually go about implementing this for your neural network first don't use graduating training or me to debug so what I mean is that computing D theta or procs eyes all the values of ID is a very slow computation so to implement gradient descent use backprop to compute D theta and just use back prop to compute the derivative and as only when you're debugging that you would compute this to Mitchell as close to D theta but once you've done that then you would turn off the grant check and don't run this during every iteration being a sentence it's just much too slow second if the never fails brag check look at the components look at the individual components to try to identify the bug so what I mean by that is is the faith in aprox is very far from DSA so what I would do is look at the different values of I to see which are the values of D theta aprox they're really very different than the values of D theta so for example um if you find that the values of theta or D theta they're very far off all corresponding to D BL for some layer or for some layers but the components for DW are quite close right remember different components of theta correspond to different components of B P and W but you find this is the case then maybe you find that some the bug is in how you're computing DP the derivative respect to parameters B and then similarly vice versa we find that the values they're very far you know the values from D theta aprox that are very far from D theta and you find that all those components came from GW or from GW and certain layer then that might help you hone in on the location of the bug there doesn't always let you identify the bug right away but sometimes it helps you give you some guesses about other where they track down the bug next um when doing grad check remember your regularization term if you're using regularization so if your cost function is J of theta equals 1 over m sum of your losses um and then plus this regularization term right some of the hell of wll Frobenius norm squared then this is the definition of J and you should have that D theta is gradients of J or respect to theta including the regularization term so just remember to include that term next Grouch egg doesn't work with dropouts because in every iteration dropout is randomly eliminating different subsets or the fit in humans there isn't a easy to compute cost function J the dropout is doing gradient descent on it turns out that dropout can be viewed as optimizing some cost function J but its cost function J is defined by summing over all exponentially large subsets of nodes they could eliminate in any iteration so the cost function J is very difficult to compute menu just sampling the cost function every time you live in a different random subset and military use gravel so it's difficult to use grad chair to double-check your computation with dropouts so what I usually do is implement grad check without dropout so you if you want in set key prop in dropout to be equal to 1.0 and then turn on dropout and hope that my implementation of dropout was correct there are some other things you could do like fix the pattern of nose dropped and verify that grad check for that a pattern of unis killed off is correct but in practice I don't usually do that so my recommendation is turn off dropout use drag check to double-check that your algorithm is at least correct without dropout and then turn on dropout so finally this is the subtlety it is not impossible rarely happens with not impossible that your implementation of gradient descent is correct when W and B are close to zero so at random initialization but that as you run grain descent and W and B become bigger maybe your implementation of back prop is correct only when W and B is close to 0 but it gives more inaccurate when W and B become large so one thing you could do I don't do this very often but one thing you could do is run drag check your randomness elevation and then train the network for a while so the wmb had some time to wonder away from zero from the small random initial values and then run drat check again after you've trained for some number of innovations so that's it so gradient checking and congratulations are coming to the end of this week's materials in this week you learned about how to set up your trained jab intersect how to analyze bias and variance and what things to do if you have high bias and Siberians versus maybe high by 9 high variance you also saw how to apply different forms of regularization like l2 regularization and drop on your neural network so some tricks for speeding up the training video network and then finally gradient checking so I think you've seen a lot in this week and you get to exercise all these ideas in this week's program exercise so best of luck exact and I look forward to seeing you in the week 2 materials

Original Description

Take the Deep Learning Specialization: http://bit.ly/2VGFA3w Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 10 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video provides practical tips for implementing gradient checking in neural networks, including using backpropagation and handling regularization and dropout layers. It also covers debugging techniques for identifying issues in the gradient approximation.

Key Takeaways

Use backpropagation to compute derivatives
Turn off gradient checking during training
Debug individual components of the gradient approximation
Include regularization terms in the cost function
Handle dropout layers by setting dropout to 1.0

💡 Gradient checking can help identify issues in the gradient approximation, but it can be slow and may not work with dropout layers.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

How to Learn a Hard Technical Skill Without Burning Out

Learn how to acquire hard technical skills without burnout by creating a sustainable learning plan

Dev.to · Anas Kalthoum | FreeBrain

After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.

Learn what makes a standout ML candidate after interviewing over 100 applicants

Medium · Machine Learning

How AI Learns with Less Labeled Data

Discover how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Medium · Machine Learning

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Learn Deep Learning by Hand (Beginner's Guide - Part 1)