Reconciling modern machine learning and the bias-variance trade-off

Yannic Kilcher · Advanced ·📐 ML Fundamentals ·6y ago

Skills: ML Maths Basics90%Supervised Learning80%ML Pipelines70%Unsupervised Learning60%

Key Takeaways

The video discusses the bias-variance trade-off in machine learning, challenging the classic view of generalization and overfitting, and explores how overparameterized functions can lead to increased smoothness and improved generalization performance. It covers topics such as random Fourier features, kernel machines, and the interpolation threshold.

Full Transcript

hi there today we're looking at reconciling modern machine learning and the bias-variance tradeoff by Mikhail Belkin at all so this paper struck me as interesting at ICML when I heard a talk by Mike McKay Belkin and the the kind of that the paper is very interesting in terms of what it proposes about modern machine learning so what's the problem the problem is they can Trask what they call classical machine learning and how to understand machine learning namely in terms of bias-variance tradeoff s-- and modern machine learning where it's for example deep neural networks which have very different properties so basically the best way to describe it is probably with an example so let's say we have four data points right here is a coordinate system in two dimensions so one two three four four data points right yeah why not all right so these four data points we want to fit a function from X to Y Y is our target so it's kind of a regression problem then let's say we have just one parameter which which we can use to describe our function probably the best thing we could do is to do something like this right which is a line and the only parameter here is the slope of that line all right so the the kind of our model would be this one line and it would pass basically through the data and we'll describe the data fairly well as you can see if we have two parameters now we can introduce for example a bias term and not have the line at the origin so this line here now we have the bias we is the distance to this point to describe it as well as the slope of this line as parameters so two parameters and if you look at this line here it distracts it describes the data a bit better than before right it passes kind of through the center of the data now if we go to three or four parameters let's go to four parameters it's well known that if I have the same number of parameters as I have the as I have data points actually actually fit the data perfectly and how to do this it will be like an order for polynomial which um let's let's see if I can draw an order for polynomial it needs to go okay well no that's okay that's more than order for in any case I can fit actually the data perfectly now if you think about all of these functions let's contrast these alright let's contrast them and let's look at what is the what is that the data distribution probably right the distribution is probably if I fill in the rest of the data that is not in our training set may be something like this right so which of these functions generalizes well to this general data the unseen data probably the first function not doing very poorly the first function actually doing okay the second function doing even better as we saw right and then if we so if we add a parameter to the first function it gets better but if we then add more parameters it gets worse so this is kind of taught in current machine learning classes as the phenomenon of overfitting whereas here here the function that has the most parameters actually doesn't fit well what is troubling now is that if you think of things like neural networks modern architectures they actually have even more different oftentimes more parameters then there are data points in the data set so they can fit the training data perfectly and still have kind of spare room spare capacity and these models actually generalize fairly well so this paper asks what's going on here and what they propose is the following picture so here we have a classical view of machine learning on the x-axis is the the complexity of age and you can think of the complexity of the this is H is the model class H is the class of all the models you could fit so if if for example it would be every linear model with one parameter this was our first model right there first model would be somewhere here one the complexity is one and then here we'd have the complexity of two where we added a parameter three parameters and four parameters and this is what we saw right at the beginning one parameter we had some some pet training risk risk here simply another term for a loss but some training loss right fit and then as we added a parameter the training loss decreased right it got better and also the test the test loss on the unseen data decreased so it got better on the test that as well as we added parameter but then as we added more parameters it was able to fit the training data better and better going to almost zero risk here but on unseen data the performance actually got worse again and that's the again this is the what we teach as overfitting these authors proposes is incomplete namely the picture actually looks like this and all we've done so far is look at this left hand side here namely that there is a peak here this is called the interpolation threshold and the interpolation threshold is roughly at the point where you have as many parameters as you have data points and after the interpolation threshold if you get give even more parameters the training risk of course stays low because you can fit the training data perfectly from the interpolation threshold forward but the test risk actually decreases again and this is really this is really interesting and let me just preamp this and say this is not due to regularization so it's not because people regularize their models or anything like this in any case regularization would actually move you to less of a complexity of your model class because now if you regularize you're no longer able to fit certain models as easily or converge to them so the they proposed that this is happening and they give some reason why this might happening and they give some evidence that this is happening so here is the evidence that this is happening and they do this here for example this is a random Fourier features classifier so what are random Fourier features they describe them here so if you have a data point X what you do is you push this through a function which or you push this through many of them you sample capital n of these vectors V and of each of the vectors V you take the inner product and raise it raise it take the exponential function of it and then aggregate them and these these are these random Fourier features these are the random Fourier features and these then are the weights that you learn so this is basically a linear classifier but not of the original features but of intermediary teachers which are fixed for a given random seed and the good thing is here you can sample you can decide how many intermediary features that you want the other good thing is if you let n go to infinity this actually becomes a infinite dimensional kernel machine so it becomes a kernel SVM with a Gaussian kernel which is operating in an infinite dimensional space but if you don't go as far then it's just an approximation to that so this it's a cool it's a cool model where you can choose how many parameters you want so it's a perfect model to explore this this phenomenon so what are they doing they are doing the following they take em nest and they just apply this model and on the x-axis here are the number of parameters that they and the number of random Fourier features that they construct and here you can see the mean squared error on the test set so as you can see at the beginning the error goes down as proposed right but then here is probably this sweet spot of classical machine learning after that you start to overfit it goes up again there's a giant peak and then it goes down again as you sow here 10,000 they I think they do it with a subset of em nest if I remember correctly and 10 around 10,000 is exactly the the number of data points they use or multiplied by the classes I don't remember correctly but in any case at this number you have the same amount of parameters as data points roughly or and after that the the test error decreases again so as you give more and more and more features every every single a fire on this line is able to fit the training data perfectly but they successfully get less and less error on the test set you can see it approaches this this dotted line here which is if you perfectly solve the infinite dimensional problem so if you actually use a kernel SVM to solve this problem that that is kind of you can see this gives you a lower bound so you can really be can really shows nicely that the around and for your features classifier approximates this as you go higher and higher with capital and it a proxy actually approximates the kernel SVM and this is really interesting that this actually happens in practice and what they also see here is when they look at the norm of the solution so the norm of the solution they calculate as basically the the they want to use ideally the norm in the Hilbert space but they can't because it's hard to compute so a proxy for this is simply the norm of the weight vector that you learn and the norm of the solution as you add more parameters of course first it goes up because you had more kind of more parameters you fit each of them they have some value and then it goes up and it peaks at this interpolation threshold there you have a really high norm solution and after that the norm goes down again of the solution and again it approximates the norm of the of the perfectly solved kernel machine so that's extremely interesting and is a part of an explanation they give why this is happening namely the following if if you have too many parameters what you might do with the correct inductive bias is find a low norm solution and what does a low norm solution mean a low norm solution means a relatively simple function so as you add parameters your model is better and better able to find a simple function that describes the training data not in terms of um not in terms of simple of less parameters but simple in terms of how it moves between the training data so if you imagine the the training data again from before actually and you imagine it a perfectly fit this polynomial here right that we drew before brothers if I have many many many more parameters I can do something like yeah I have many parameters but I can be kind of looking but they have late right so this something like this here I grab this here I grab this something like this and this moves smoothly between the training data it has many parameters because there's many many squiggles here but it's a low norm solution the low norm will cause the solution to kind of be smooth whereas a high norm solution that perfectly interpolates the training data would look something like this right so the authors here say if you're inductive bias is able to find a low norm solution that perfectly fits the training data then that will generalize well and it turns out that modern architectures tends to find low norm solutions if you train them for example with SGD and and that's a so-so the combination of many parameters and low known solutions will give you a smooth function and the smoothness of the function will be the thing that generalizes to unseen data because the smoothness kind of ensures that everything in between the data will be nicely kind of interpolated here here all right so that's the the perspective they go on from these random Fourier features to neural networks and what they do here is they train a neural network on em nest with a one hidden layer so there's two weight layers now and again you can see as the as the number of parameters so this means basically the number of the hidden no they increase the number of hidden nodes in the hidden layer and as they increase this that training and test error go down training error continues to go down test error goes up until the interpolation threshold again and then the test error drops again while there the training error continues to be almost zero and they do the same thing with decision trees and random forests and show the exact same thing that there is this interpolation threshold after which the test error drops even though the training error is almost zero so to me this is really remarkable and they show this in the appendix have many many more experiments where they they show this phenomenon happening on different data sets and on different architectures here random relu features and so on and it kind of gives a new perspective on generalization and why our models generalized so well they finally conclude with why has it has not been seen yet and they give some nice reasons basically that for example models where you can choose the models where you can choose the the complexity for example random phoria features are originally proposed as an approximation to kernel machines if you have too many data points and don't want to compute as many features so they they're basically only ever used in this regime where the classical paradigm holds and then neural networks on the other hand often are simply made super large and they say this peak here that they show is very localized and you might if you increase your neural network maybe you try one at this size this size this size and this size and all you then see is kind of a downward trajectory you kind of missed this peak so it leads to the impression that simply Oh bigger neural networks perform better yeah so I found this interesting I hope you did as well and definitely check out more of this group's work that was it for now have a nice day

Original Description

It turns out that the classic view of generalization and overfitting is incomplete! If you add parameters beyond the number of points in your dataset, generalization performance might increase again due to the increased smoothness of overparameterized functions. Abstract: The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample. However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data. How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice? In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation. Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited. Authors: Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal https://arxiv.org/abs/1812.11118

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Yannic Kilcher · Yannic Kilcher · 25 of 60

← Previous Next →

Imagination-Augmented Agents for Deep Reinforcement Learning

Imagination-Augmented Agents for Deep Reinforcement Learning

Learning model-based planning from scratch

Learning model-based planning from scratch

Reinforcement Learning with Unsupervised Auxiliary Tasks

Reinforcement Learning with Unsupervised Auxiliary Tasks

Attention Is All You Need

Attention Is All You Need

git for research basics: fundamentals, commits, branches, merging

git for research basics: fundamentals, commits, branches, merging

Curiosity-driven Exploration by Self-supervised Prediction

Curiosity-driven Exploration by Self-supervised Prediction

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations

Stochastic RNNs without Teacher-Forcing

Stochastic RNNs without Teacher-Forcing

What’s in a name? The need to nip NIPS

What’s in a name? The need to nip NIPS

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

GPT-2: Language Models are Unsupervised Multitask Learners

GPT-2: Language Models are Unsupervised Multitask Learners

Neural Ordinary Differential Equations

Neural Ordinary Differential Equations

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Discriminating Systems - Gender, Race, and Power in AI

Discriminating Systems - Gender, Race, and Power in AI

Blockwise Parallel Decoding for Deep Autoregressive Models

Blockwise Parallel Decoding for Deep Autoregressive Models

S.H.E. - Search. Human. Equalizer.

S.H.E. - Search. Human. Equalizer.

Reinforcement Learning, Fast and Slow

Reinforcement Learning, Fast and Slow

Adversarial Examples Are Not Bugs, They Are Features

Adversarial Examples Are Not Bugs, They Are Features

I'm at ICML19 :)

I'm at ICML19 :)

Population-Based Search and Open-Ended Algorithms

Population-Based Search and Open-Ended Algorithms

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Conversation about Population-Based Methods (Re-upload)

Conversation about Population-Based Methods (Re-upload)

Reconciling modern machine learning and the bias-variance trade-off

Reconciling modern machine learning and the bias-variance trade-off

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Manifold Mixup: Better Representations by Interpolating Hidden States

Manifold Mixup: Better Representations by Interpolating Hidden States

Processing Megapixel Images with Deep Attention-Sampling Models

Processing Megapixel Images with Deep Attention-Sampling Models

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Gauge Equivariant Convolutional Networks and the Icosahedral CNN

Auditing Radicalization Pathways on YouTube

Auditing Radicalization Pathways on YouTube

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules

DEEP LEARNING MEME REVIEW - Episode 1

DEEP LEARNING MEME REVIEW - Episode 1

Accelerating Deep Learning by Focusing on the Biggest Losers

Accelerating Deep Learning by Focusing on the Biggest Losers

[News] The Siraj Raval Controversy

[News] The Siraj Raval Controversy

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

The Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model learns successor representations in partially observable environments

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

NeurIPS 19 Poster Session

NeurIPS 19 Poster Session

Go-Explore: a New Approach for Hard-Exploration Problems

Go-Explore: a New Approach for Hard-Exploration Problems

Reformer: The Efficient Transformer

Reformer: The Efficient Transformer

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

[Interview] Mark Ledwich - Algorithmic Extremism: Examining YouTube's Rabbit Hole of Radicalization

Turing-NLG, DeepSpeed and the ZeRO optimizer

Turing-NLG, DeepSpeed and the ZeRO optimizer

Growing Neural Cellular Automata

Growing Neural Cellular Automata

NeurIPS 2020 Changes to Paper Submission Process

NeurIPS 2020 Changes to Paper Submission Process

Deep Learning for Symbolic Mathematics

Deep Learning for Symbolic Mathematics

Online Education - How I Make My Videos

Online Education - How I Make My Videos

[Rant] coronavirus

[Rant] coronavirus

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting

Agent57: Outperforming the Atari Human Benchmark

Agent57: Outperforming the Atari Human Benchmark

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

State-of-Art-Reviewing: A Radical Proposal to Improve Scientific Publication

Dream to Control: Learning Behaviors by Latent Imagination

Dream to Control: Learning Behaviors by Latent Imagination

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and Solutions

Evaluating NLP Models via Contrast Sets

Evaluating NLP Models via Contrast Sets

[Drama] Who invented Contrast Sets?

[Drama] Who invented Contrast Sets?

The video challenges the classic view of generalization and overfitting, and explores how overparameterized functions can lead to increased smoothness and improved generalization performance. It covers topics such as random Fourier features, kernel machines, and the interpolation threshold. By watching this video, learners can gain a deeper understanding of the bias-variance trade-off and how to apply overparameterization to improve generalization.

Key Takeaways

Understand the bias-variance trade-off and its implications for machine learning
Apply random Fourier features to create a linear classifier with adjustable complexity
Visualize the bias-variance trade-off using a mean squared error plot
Train models with overparameterization to improve generalization
Evaluate the effect of overparameterization on generalization using the interpolation threshold

💡 Overparameterized functions can lead to increased smoothness and improved generalization performance, challenging the classic view of generalization and overfitting.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

How to Learn a Hard Technical Skill Without Burning Out

Learn how to acquire hard technical skills without burnout by creating a sustainable learning plan

Dev.to · Anas Kalthoum | FreeBrain

After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.

Learn what makes a standout ML candidate after interviewing over 100 applicants

Medium · Machine Learning

How AI Learns with Less Labeled Data

Discover how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Medium · Machine Learning

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Learn Deep Learning by Hand (Beginner's Guide - Part 1)