Loss Functions - EXPLAINED!

CodeEmporium · Beginner ·📄 Research Papers Explained ·6y ago

Skills: Supervised Learning90%ML Maths Basics80%

Key Takeaways

The video explains various loss functions in machine learning, including squared loss, absolute loss, pseudo-Huber loss, cross-entropy loss, and hinge loss, with a discussion on their pros and cons, and introduces the concept of adaptive loss functions.

Full Transcript

what is the best loss function the age-old question in machine learning will we solve this problem today nope but we will talk about some loss functions their pros and cons and even discuss a recent paper on adaptive loss functions so that would mean that you don't need to keep trying out different losses to find the best one that suits your needs fun stuff ok first let's focus on regression I've got this data set here it looks like a line can fit this data I'll train a linear regression model on it and I choose to use the squared loss to minimize it the l2 loss my curve looks something like this okay looks pretty good and it's also pretty simple but if I introduce some outliers in this data my model responds by freaking the hell out and trying to fit those data points better this happens because that square term scales the errors by these outliers so the model really wants to get these obscure points right I'll just change the loss function to the absolute difference my model now treats the outliers like any other data point so it won't go out of its way for outliers if it means compromising the rest of the model this might lead to poor predictions from time to time but if you really don't care about the extreme cases this will do support vector regression uses this by the way the advantage of the squared error is the ease with which we can compute the gradient for machine learning during gradient descent this gradient is not as simple in the absolute error case because of the points of discontinuity the mean absolute error isn't optimized through gradient descent but it's optimized by computing sub gradients instead it adds a bit more complexity and I'll add some reading material in the description down below we got two losses one that loves outliers and another that ignores them if you think that one doesn't work you'll just use the other and that might be fine in most cases but consider this our data is about like 70% in one direction and 30% in the other direction technically this data does not have any outliers but our absolute loss may treat the 30% data as outliers and ignore it altogether while the squared loss will try to capture those 30% both decisions can lead to poor model performance how do we compromise we can do so by using the pseudo Hueber loss this is the best of both losses if a data point has a relatively low error we take the squared loss if the data point is an outlier we take the absolute loss the result is that it reduces the effects of outliers on the model while still being at different Schabel and as such it's slightly more complex the main problem here is that we have an extra hyper parameter play with these are the most popular regression based losses that you see in built-in regressors now for classification losses in classification our outputs are obviously the class but more precisely it's the list of probabilities of belonging to different classes and we just choose a class with the highest probability cuz duh this list is a probability distribution we compare this to the ground truth and how we compare it depends on the losses we use so cross-entropy loss entropy has its roots in information theory so I'll explain it from that perspective so say that there's this weather station and it sends you a weather forecast at the beginning of each day and it tells you what weather it is on that day using some n bits of information in its best case say this information can be packed in as low as 3 bits on average 2 bits for sunny four bits for a rainy day three bits for a partly cloudy day and so on the entropy of a distribution is the average number of bits required to convey a piece of information like today's weather in this case so the entropy in this example is three three bits but the tower isn't perfect it's designed by engineers who have flaws themselves there is some wastage and it is found that the tower actually sends you five bits on average this is cross entropy we are comparing the true average and the satellites current average entropy is three bits but cross entropies five bits this means that we could have had a system that tells us the weather with just three bits but we have a system currently that is our satellite that is using five bits to do the same thing ideally we want these numbers to be much closer to each other this two-bit difference is known as the KL or the Colback Liebherr divergence this little satellite is actually similar to a model that we trained to predict the weather in machine learning as a classification problem and so in many classification problems cross entropy and KL divergence are often used as loss functions to minimize another loss is the hinge loss typically used in support vector machines for classification tasks minimizing this we get a boundary that splits the data well and is as far away from every data point as possible that is it maximizes the minimum margin from the data points this loss penalizes data points even if they are correctly labeled if they lie in this margin I've made several overly mathematical videos on kernels and SVM's check it out if you want to lower your self-esteem I'm gonna wrap up this video with a paper discussion we've taken a rough look at six common losses for classification and regression but there are far more some better suited for certain problems we have a set of points we want to fit a regression line through squared loss does it decently well we had outliers and try to fit it again doesn't look too great anymore so we try the pseudo Hueber loss and this gives us better results but I'm not satisfied yet so let's try some other losses so we have the Welsh loss results our trash giveme McClair loss it fits this data better now cauchy's loss this fits the data even better I like this it's nice that I found the loss function I liked but I found this by trial and error is there a way that it could have just used a loss function without trial and error and somehow arrived at the actual minimum that I wanted turns out that all these losses that I mentioned can be generalized to this equation by setting different values for alpha which is a shape parameter how do we add alpha into the mix though maximum likelihood estimation we maximize the likelihood of the probability distribution or minimize the negative log likelihood so it becomes an adaptive loss this technique is typically used to derive losses mathematically and this actually leads to some interesting results here are some examples of images when we let a variational auto encoder to determine a loss and generate images they aren't half-bad the idea of an adaptive loss sounds like an amazing idea hope you all have a better idea behind loss functions the differences between them and a sprinkle of research on adaptive loss functions and so we can avoid trial and error to determine the most appropriate losses I have resources in the description below if you like these videos please subscribe to keep the lights up in my little apartment and I will see you soon bye

Original Description

Many animations used in this video came from Jonathan Barron [1, 2]. Give this researcher a like for his hard work! SUBSCRIBE FOR MORE CONTENT! RESEOURCES [1] Paper on adaptive loss function: https://arxiv.org/abs/1701.03077 [2] CVPR paper presentation: https://www.youtube.com/watch?v=BmNKbnF69eY [3] Regression Loss Functions: https://alexisalulema.com/2017/12/07/loss-functions-part-1/ [4] Classification Losses: https://alexisalulema.com/2017/12/07/loss-functions-part-1/ [5] ML cheat sheet for loss functions: https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html [6] 7 loss functions with python code: https://www.analyticsvidhya.com/blog/2019/08/detailed-guide-7-loss-functions-machine-learning-python-code/ [7] A Blog for most common Loss Functions: https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3 [8] Modeling the Huber loss: https://www.textbook.ds100.org/ch/10/modeling_abs_huber.html [9] Notes on Subgradients: https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf [10] Code to get up to speed: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html [11] What is the difference between KL divergence and Cross Entropy loss: https://stats.stackexchange.com/questions/357963/what-is-the-difference-cross-entropy-and-kl-divergence [12] A great video explanation of Entropy, Cross Entropy and KL divergence: https://www.youtube.com/watch?v=ErfnhcEV1O8

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 39 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video teaches the basics of loss functions in machine learning, including their types, pros, and cons, and introduces adaptive loss functions as a way to avoid trial and error in determining the most suitable loss function.

Key Takeaways

Choose a loss function for a regression problem
Implement a regression model with a chosen loss function
Evaluate the model's performance
Choose a loss function for a classification problem
Implement a classification model with a chosen loss function
Evaluate the model's performance
Consider using adaptive loss functions to avoid trial and error

💡 Adaptive loss functions can be used to avoid trial and error in determining the most suitable loss function for a problem.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way

Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics

ICMI 2026 Reviews [D]

Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances

Reddit r/MachineLearning

Workshop submission for main conference paper under review [D]

Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV

Reddit r/MachineLearning

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it

Reddit r/MachineLearning

Beyond Big Vendors: ERP Systems Explained #shorts

Digital Transformation with Eric Kimberling