Loss Functions - EXPLAINED!

CodeEmporium · Beginner ·📄 Research Papers Explained ·6y ago

Key Takeaways

The video explains various loss functions in machine learning, including squared loss, absolute loss, pseudo-Huber loss, cross-entropy loss, and hinge loss, with a discussion on their pros and cons, and introduces the concept of adaptive loss functions.

Full Transcript

what is the best loss function the age-old question in machine learning will we solve this problem today nope but we will talk about some loss functions their pros and cons and even discuss a recent paper on adaptive loss functions so that would mean that you don't need to keep trying out different losses to find the best one that suits your needs fun stuff ok first let's focus on regression I've got this data set here it looks like a line can fit this data I'll train a linear regression model on it and I choose to use the squared loss to minimize it the l2 loss my curve looks something like this okay looks pretty good and it's also pretty simple but if I introduce some outliers in this data my model responds by freaking the hell out and trying to fit those data points better this happens because that square term scales the errors by these outliers so the model really wants to get these obscure points right I'll just change the loss function to the absolute difference my model now treats the outliers like any other data point so it won't go out of its way for outliers if it means compromising the rest of the model this might lead to poor predictions from time to time but if you really don't care about the extreme cases this will do support vector regression uses this by the way the advantage of the squared error is the ease with which we can compute the gradient for machine learning during gradient descent this gradient is not as simple in the absolute error case because of the points of discontinuity the mean absolute error isn't optimized through gradient descent but it's optimized by computing sub gradients instead it adds a bit more complexity and I'll add some reading material in the description down below we got two losses one that loves outliers and another that ignores them if you think that one doesn't work you'll just use the other and that might be fine in most cases but consider this our data is about like 70% in one direction and 30% in the other direction technically this data does not have any outliers but our absolute loss may treat the 30% data as outliers and ignore it altogether while the squared loss will try to capture those 30% both decisions can lead to poor model performance how do we compromise we can do so by using the pseudo Hueber loss this is the best of both losses if a data point has a relatively low error we take the squared loss if the data point is an outlier we take the absolute loss the result is that it reduces the effects of outliers on the model while still being at different Schabel and as such it's slightly more complex the main problem here is that we have an extra hyper parameter play with these are the most popular regression based losses that you see in built-in regressors now for classification losses in classification our outputs are obviously the class but more precisely it's the list of probabilities of belonging to different classes and we just choose a class with the highest probability cuz duh this list is a probability distribution we compare this to the ground truth and how we compare it depends on the losses we use so cross-entropy loss entropy has its roots in information theory so I'll explain it from that perspective so say that there's this weather station and it sends you a weather forecast at the beginning of each day and it tells you what weather it is on that day using some n bits of information in its best case say this information can be packed in as low as 3 bits on average 2 bits for sunny four bits for a rainy day three bits for a partly cloudy day and so on the entropy of a distribution is the average number of bits required to convey a piece of information like today's weather in this case so the entropy in this example is three three bits but the tower isn't perfect it's designed by engineers who have flaws themselves there is some wastage and it is found that the tower actually sends you five bits on average this is cross entropy we are comparing the true average and the satellites current average entropy is three bits but cross entropies five bits this means that we could have had a system that tells us the weather with just three bits but we have a system currently that is our satellite that is using five bits to do the same thing ideally we want these numbers to be much closer to each other this two-bit difference is known as the KL or the Colback Liebherr divergence this little satellite is actually similar to a model that we trained to predict the weather in machine learning as a classification problem and so in many classification problems cross entropy and KL divergence are often used as loss functions to minimize another loss is the hinge loss typically used in support vector machines for classification tasks minimizing this we get a boundary that splits the data well and is as far away from every data point as possible that is it maximizes the minimum margin from the data points this loss penalizes data points even if they are correctly labeled if they lie in this margin I've made several overly mathematical videos on kernels and SVM's check it out if you want to lower your self-esteem I'm gonna wrap up this video with a paper discussion we've taken a rough look at six common losses for classification and regression but there are far more some better suited for certain problems we have a set of points we want to fit a regression line through squared loss does it decently well we had outliers and try to fit it again doesn't look too great anymore so we try the pseudo Hueber loss and this gives us better results but I'm not satisfied yet so let's try some other losses so we have the Welsh loss results our trash giveme McClair loss it fits this data better now cauchy's loss this fits the data even better I like this it's nice that I found the loss function I liked but I found this by trial and error is there a way that it could have just used a loss function without trial and error and somehow arrived at the actual minimum that I wanted turns out that all these losses that I mentioned can be generalized to this equation by setting different values for alpha which is a shape parameter how do we add alpha into the mix though maximum likelihood estimation we maximize the likelihood of the probability distribution or minimize the negative log likelihood so it becomes an adaptive loss this technique is typically used to derive losses mathematically and this actually leads to some interesting results here are some examples of images when we let a variational auto encoder to determine a loss and generate images they aren't half-bad the idea of an adaptive loss sounds like an amazing idea hope you all have a better idea behind loss functions the differences between them and a sprinkle of research on adaptive loss functions and so we can avoid trial and error to determine the most appropriate losses I have resources in the description below if you like these videos please subscribe to keep the lights up in my little apartment and I will see you soon bye

Original Description

Many animations used in this video came from Jonathan Barron [1, 2]. Give this researcher a like for his hard work! SUBSCRIBE FOR MORE CONTENT! RESEOURCES [1] Paper on adaptive loss function: https://arxiv.org/abs/1701.03077 [2] CVPR paper presentation: https://www.youtube.com/watch?v=BmNKbnF69eY [3] Regression Loss Functions: https://alexisalulema.com/2017/12/07/loss-functions-part-1/ [4] Classification Losses: https://alexisalulema.com/2017/12/07/loss-functions-part-1/ [5] ML cheat sheet for loss functions: https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html [6] 7 loss functions with python code: https://www.analyticsvidhya.com/blog/2019/08/detailed-guide-7-loss-functions-machine-learning-python-code/ [7] A Blog for most common Loss Functions: https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3 [8] Modeling the Huber loss: https://www.textbook.ds100.org/ch/10/modeling_abs_huber.html [9] Notes on Subgradients: https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf [10] Code to get up to speed: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html [11] What is the difference between KL divergence and Cross Entropy loss: https://stats.stackexchange.com/questions/357963/what-is-the-difference-cross-entropy-and-kl-divergence [12] A great video explanation of Entropy, Cross Entropy and KL divergence: https://www.youtube.com/watch?v=ErfnhcEV1O8
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 39 of 60

1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
7 Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
8 Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
40 Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video teaches the basics of loss functions in machine learning, including their types, pros, and cons, and introduces adaptive loss functions as a way to avoid trial and error in determining the most suitable loss function.

Key Takeaways
  1. Choose a loss function for a regression problem
  2. Implement a regression model with a chosen loss function
  3. Evaluate the model's performance
  4. Choose a loss function for a classification problem
  5. Implement a classification model with a chosen loss function
  6. Evaluate the model's performance
  7. Consider using adaptive loss functions to avoid trial and error
💡 Adaptive loss functions can be used to avoid trial and error in determining the most suitable loss function for a problem.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
Beyond Big Vendors: ERP Systems Explained #shorts
Digital Transformation with Eric Kimberling
Watch →