Loss Functions - EXPLAINED!
Key Takeaways
The video explains various loss functions in machine learning, including squared loss, absolute loss, pseudo-Huber loss, cross-entropy loss, and hinge loss, with a discussion on their pros and cons, and introduces the concept of adaptive loss functions.
Full Transcript
what is the best loss function the age-old question in machine learning will we solve this problem today nope but we will talk about some loss functions their pros and cons and even discuss a recent paper on adaptive loss functions so that would mean that you don't need to keep trying out different losses to find the best one that suits your needs fun stuff ok first let's focus on regression I've got this data set here it looks like a line can fit this data I'll train a linear regression model on it and I choose to use the squared loss to minimize it the l2 loss my curve looks something like this okay looks pretty good and it's also pretty simple but if I introduce some outliers in this data my model responds by freaking the hell out and trying to fit those data points better this happens because that square term scales the errors by these outliers so the model really wants to get these obscure points right I'll just change the loss function to the absolute difference my model now treats the outliers like any other data point so it won't go out of its way for outliers if it means compromising the rest of the model this might lead to poor predictions from time to time but if you really don't care about the extreme cases this will do support vector regression uses this by the way the advantage of the squared error is the ease with which we can compute the gradient for machine learning during gradient descent this gradient is not as simple in the absolute error case because of the points of discontinuity the mean absolute error isn't optimized through gradient descent but it's optimized by computing sub gradients instead it adds a bit more complexity and I'll add some reading material in the description down below we got two losses one that loves outliers and another that ignores them if you think that one doesn't work you'll just use the other and that might be fine in most cases but consider this our data is about like 70% in one direction and 30% in the other direction technically this data does not have any outliers but our absolute loss may treat the 30% data as outliers and ignore it altogether while the squared loss will try to capture those 30% both decisions can lead to poor model performance how do we compromise we can do so by using the pseudo Hueber loss this is the best of both losses if a data point has a relatively low error we take the squared loss if the data point is an outlier we take the absolute loss the result is that it reduces the effects of outliers on the model while still being at different Schabel and as such it's slightly more complex the main problem here is that we have an extra hyper parameter play with these are the most popular regression based losses that you see in built-in regressors now for classification losses in classification our outputs are obviously the class but more precisely it's the list of probabilities of belonging to different classes and we just choose a class with the highest probability cuz duh this list is a probability distribution we compare this to the ground truth and how we compare it depends on the losses we use so cross-entropy loss entropy has its roots in information theory so I'll explain it from that perspective so say that there's this weather station and it sends you a weather forecast at the beginning of each day and it tells you what weather it is on that day using some n bits of information in its best case say this information can be packed in as low as 3 bits on average 2 bits for sunny four bits for a rainy day three bits for a partly cloudy day and so on the entropy of a distribution is the average number of bits required to convey a piece of information like today's weather in this case so the entropy in this example is three three bits but the tower isn't perfect it's designed by engineers who have flaws themselves there is some wastage and it is found that the tower actually sends you five bits on average this is cross entropy we are comparing the true average and the satellites current average entropy is three bits but cross entropies five bits this means that we could have had a system that tells us the weather with just three bits but we have a system currently that is our satellite that is using five bits to do the same thing ideally we want these numbers to be much closer to each other this two-bit difference is known as the KL or the Colback Liebherr divergence this little satellite is actually similar to a model that we trained to predict the weather in machine learning as a classification problem and so in many classification problems cross entropy and KL divergence are often used as loss functions to minimize another loss is the hinge loss typically used in support vector machines for classification tasks minimizing this we get a boundary that splits the data well and is as far away from every data point as possible that is it maximizes the minimum margin from the data points this loss penalizes data points even if they are correctly labeled if they lie in this margin I've made several overly mathematical videos on kernels and SVM's check it out if you want to lower your self-esteem I'm gonna wrap up this video with a paper discussion we've taken a rough look at six common losses for classification and regression but there are far more some better suited for certain problems we have a set of points we want to fit a regression line through squared loss does it decently well we had outliers and try to fit it again doesn't look too great anymore so we try the pseudo Hueber loss and this gives us better results but I'm not satisfied yet so let's try some other losses so we have the Welsh loss results our trash giveme McClair loss it fits this data better now cauchy's loss this fits the data even better I like this it's nice that I found the loss function I liked but I found this by trial and error is there a way that it could have just used a loss function without trial and error and somehow arrived at the actual minimum that I wanted turns out that all these losses that I mentioned can be generalized to this equation by setting different values for alpha which is a shape parameter how do we add alpha into the mix though maximum likelihood estimation we maximize the likelihood of the probability distribution or minimize the negative log likelihood so it becomes an adaptive loss this technique is typically used to derive losses mathematically and this actually leads to some interesting results here are some examples of images when we let a variational auto encoder to determine a loss and generate images they aren't half-bad the idea of an adaptive loss sounds like an amazing idea hope you all have a better idea behind loss functions the differences between them and a sprinkle of research on adaptive loss functions and so we can avoid trial and error to determine the most appropriate losses I have resources in the description below if you like these videos please subscribe to keep the lights up in my little apartment and I will see you soon bye
Original Description
Many animations used in this video came from Jonathan Barron [1, 2]. Give this researcher a like for his hard work!
SUBSCRIBE FOR MORE CONTENT!
RESEOURCES
[1] Paper on adaptive loss function: https://arxiv.org/abs/1701.03077
[2] CVPR paper presentation: https://www.youtube.com/watch?v=BmNKbnF69eY
[3] Regression Loss Functions: https://alexisalulema.com/2017/12/07/loss-functions-part-1/
[4] Classification Losses: https://alexisalulema.com/2017/12/07/loss-functions-part-1/
[5] ML cheat sheet for loss functions: https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
[6] 7 loss functions with python code: https://www.analyticsvidhya.com/blog/2019/08/detailed-guide-7-loss-functions-machine-learning-python-code/
[7] A Blog for most common Loss Functions: https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3
[8] Modeling the Huber loss: https://www.textbook.ds100.org/ch/10/modeling_abs_huber.html
[9] Notes on Subgradients: https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf
[10] Code to get up to speed: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
[11] What is the difference between KL divergence and Cross Entropy loss: https://stats.stackexchange.com/questions/357963/what-is-the-difference-cross-entropy-and-kl-divergence
[12] A great video explanation of Entropy, Cross Entropy and KL divergence: https://www.youtube.com/watch?v=ErfnhcEV1O8
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 39 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
▶
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: Supervised Learning
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Medium · AI
ICMI 2026 Reviews [D]
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
🎓
Tutor Explanation
DeepCamp AI