How optimization for machine learning works, part 1

Brandon Rohrer · Intermediate ·📐 ML Fundamentals ·7y ago

Skills: ML Maths Basics80%Supervised Learning60%

Key Takeaways

Explains optimization for machine learning using an error function

Full Transcript

optimization is a fancy word for finding the best way we can see how it works if we take a look at drinking tea there's a best temperature for tea if your tea is too hot it'll scald your tongue and you won't be able to taste anything for days if it's lukewarm it's entirely unsatisfying and there's this sweet spot in the Middle where it's comfortably hot warming you from the inside out all the way down your throat and radiating through your belly this is the ideal temperature for tea this happy medium is what we try to find in optimization that's what goldilock was looking for when she tried Papa Bear's bed and found it too hard tried Mama Bear's bed and found it too soft and then tried baby bear's bed and found it to be just right finding how to get things just right turns out to be a very common problem mathematicians and computer scientists love it because it's very specific it's well formulated you know when you've got it right and you can compare Your solution against others to see who got it right faster when a computer scientist tries to find the right temperature for tea the first thing they do is flip the problem upside down instead of trying to maximize tea drinking enjoyment they try to minimize suffering while drinking tea the results the same and the math works out in the same way it's not that all computer scientists are pessimist s but just that most optimization problems are naturally described in terms of costs money time resources rather than benefits in math it's convenient to make all your problems look the same before you work out a solution so you can solve it just the one time in machine learning this is often called an error function because error is the undesirable thing the suffering it's being minimized it can also be called a cost function or a loss function or an energy function but they all mean pretty much the same thing there are a handful of ways to go about finding the best temperature for serving tea the most obvious is just to look at the curve and pick the lowest point unfortunately we don't actually know what the curve is when we start out this is implicit in the optimization problem but we can make use of our original idea and just measure the Curve we can prepare a cup of tea at a given temperature serve it and ask our unwitting test subject how they enjoyed it then we can repeat this process for every temperature across the whole range that we care about by the time we're done with this we do know what the whole curve looks like and then we can just pick the temperature for which our tea drinker reported the most enjoyment or the least suffering this way of finding the best tea temperature is called exhaustive search it's straightforward and effective but it might take a while if our time is limited it's worth it to check out a few other methods if you imagine that our T suffering curve is actually a physical Bowl then we could easily find the Bottom by dropping a marble in and letting it roll until it stops this is the intuition behind gradient descent literally going downhill to use gradient descent we start at an arbitrary temperature before beginning we don't know anything about our curve so we make a random guess and we brew a cup of tea at that temperature and see how well our tea drinker likes it from there the next trip is to figure out which direction is downhill and which is up to figure this out we choose a direction again arbitrarily and we choose a new temperature a very small distance away let's say we choose to the left cooler temperatures then we Brew up another cup of tea at this slightly lower temperature and see whether or not it's better than the first we discover that it's actually inferior our tea drinker likes it less now we know that downhill is to the right that we need to make our next cup warmer to make it better we take a larger step in the direction of warmer tea Brew up a new cup and start the process over again and then we repeat this we do it over and over until we get to the very best temperature for tea the steeper the slope the larger the step we can take and we'll know that we're all done when we take a small step away and get the exact same level of enjoyment for our tea drinker this can only happen at the bottom of the bowl where it's flat and there is no downhill there are lots of gradient descent methods most of them are clever ways to measure the slope as efficiently as possible and to get to the bottom of the bowl in as few steps as possible they're all tricks to brew as few Cups of Tea as we can get away with they use different tricks to avoid completely calculating the slope or to choose a step size that is as large as can be gotten away with but the underlying intuition is the same one of the tricks to find the bottom of the bowl in fewer steps is to use not just the slope but also curvature when deciding how big of a step to take as the marble starts to roll down the side of the bowl is the slope getting steeper if so then the bottom is probably still far away take a big step or is the slope getting shallower and starting to bottom out if so the bottom is probably getting closer take smaller steps now curvature this slope of the slope or Hess to give it its rightful name can be very helpful if you're trying to take as few steps as possible however it can also be much more expensive to compute this is a trade-off that comes up a lot in optimization we end up choosing between the number of steps we have to take and how hard it is to compute where the next step should be like a lot of math problems the more assumptions you're able to make the better the solution you can come up with unfortunately when working with real data a lot of those assumptions don't always apply there are a lot of ways that this drop a marble approach can fail if there's more than one Valley for a marble to roll into we might miss the deepest one each of these little bowls or valleys is called a local minimum we are interested in finding the global minimum the deepest of all the bowls the lowest of all the local Minima imagine that we're testing our tea temperatures on a hot day it may be that once the tea becomes cold enough it makes a great iced tea which is even more popular but we could never find that out by gradient descent alone also if the error function is not smooth there are lots of places a marble could get stuck this could happen if our tea drinker's enjoyment was heavily impacted by passing trains for instance the periodic occurrence of trains could introduce a wiggle into our data if the error function you're trying to optimize makes discrete jumps that presents a challenge too marbles don't roll downstairs well this could happen for example if our tea drinkers have to rate their enjoyment on a 10-point scale if the error function is mostly a plateau but has a bottom that's narrow and deep then the marble is unlikely to find it perhaps our tea drinkers are very finicky and absolutely despise all tea that is anything but Perfect all of these occur in real machine learning optimization problems if we suspect that our T satisfaction curve has any of these tricky characteristics we can always fall back to exhaustive Sur unfortunately exhaustive surge takes an extremely long time for a lot of problems but luckily for us there's a middle ground there's a set of methods that's tougher than gradient descent they go by names like genetic algorithms evolutionary algorithms and simulated analing they take longer to compute and they take more steps but they don't break nearly so easily each has its own quirks but one characteristic that most of them share is a Randomness to their steps and jumps this helps them discover the deepest valleys of the error function even when they're harder to find optimization algorithms that rely on gradient descent are kind of like Formula 1 race cars they're extremely fast and efficient but they require a very well- behaved track to work well or error function to work well a poorly placed speed bump can wreck it the more robust methods are like four-wheel drive pickup trucks they don't go nearly as fast but they can handle a lot more variability in the terrain and exhaustive search is like traveling on foot you can get absolutely anywhere but it may take you a really long time they're each invaluable in different situations now that we've talked about how optimization Works click the link below to join me for part two of this series where we step through an example of optimization in a machine learning model

Original Description

Part of the End-to-End Machine Learning School course library at http://e2eml.school See these concepts used in an End to End Machine Learning project: https://end-to-end-machine-learning.teachable.com/p/polynomial-regression-optimization/ Watch the rest of the How Optimization Works series: https://end-to-end-machine-learning.teachable.com/p/building-blocks-how-optimization-works/

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Brandon Rohrer · Brandon Rohrer · 33 of 60

← Previous Next →

Robot Learning with a Biologically-Inspired Brain (BECCA)

Robot Learning with a Biologically-Inspired Brain (BECCA)

BECCA talk at AGI 2011

BECCA talk at AGI 2011

Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel

Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel

BECCA listens to The Hobbit

BECCA listens to The Hobbit

Learning the building blocks of speech: BECCA extracts a hierarchy of audio features

Learning the building blocks of speech: BECCA extracts a hierarchy of audio features

BECCA listens for sound effects in The Hobbit

BECCA listens for sound effects in The Hobbit

BECCA finds movie trailers while watching the Big Bang Theory

BECCA finds movie trailers while watching the Big Bang Theory

Listening for unexpected sounds: BECCA detects anomalies in audio data

Listening for unexpected sounds: BECCA detects anomalies in audio data

Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features

Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features

Watching for the unexpected: BECCA detects anomalies in video data

Watching for the unexpected: BECCA detects anomalies in video data

BECCA finds a stationary target

BECCA finds a stationary target

BECCA finds a stationary target at 3X speed

BECCA finds a stationary target at 3X speed

BECCA watches the X-men and Bruce Lee

BECCA watches the X-men and Bruce Lee

BECCA plays Quidditch

BECCA plays Quidditch

BECCA chases a ball

BECCA chases a ball

BECCA chases a ball, part 2

BECCA chases a ball, part 2

Becca chases a ball, part 3

Becca chases a ball, part 3

BECCA creates features from MNIST

BECCA creates features from MNIST

How reinforcement learning works in Becca 7

How reinforcement learning works in Becca 7

Deep Learning Demystified

Deep Learning Demystified

How Data Science Works

How Data Science Works

How Convolutional Neural Networks work

How Convolutional Neural Networks work

How Bayes Theorem works

How Bayes Theorem works

How Deep Neural Networks Work

How Deep Neural Networks Work

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)

How Support Vector Machines work / How to open a black box

How Support Vector Machines work / How to open a black box

How autocorrelation works

How autocorrelation works

Getting closer to human intelligence through robotics

Getting closer to human intelligence through robotics

A minimalist's guide to slicing and indexing pandas DataFrames

A minimalist's guide to slicing and indexing pandas DataFrames

How decision trees work

How decision trees work

Data scientist archetypes

Data scientist archetypes

How to use python's datetime package

How to use python's datetime package

How optimization for machine learning works, part 1

How optimization for machine learning works, part 1

How optimization for machine learning works, part 2

How optimization for machine learning works, part 2

How optimization for machine learning works, part 3

How optimization for machine learning works, part 3

How optimization for machine learning works, part 4

How optimization for machine learning works, part 4

How convolutional neural networks work, in depth

How convolutional neural networks work, in depth

How to pick a machine learning model 4: Splitting the data

How to pick a machine learning model 4: Splitting the data

How to pick a machine learning model 3: Choosing a loss function

How to pick a machine learning model 3: Choosing a loss function

How to pick a machine learning model 2: Separating signal from noise

How to pick a machine learning model 2: Separating signal from noise

How to pick a machine learning model 1: Choosing between models

How to pick a machine learning model 1: Choosing between models

How to pick a machine learning model 5: Navigating assumptions

How to pick a machine learning model 5: Navigating assumptions

What do neural networks learn?

What do neural networks learn?

Interview with iRobot's Director of Data Science Angela Bassa

Interview with iRobot's Director of Data Science Angela Bassa

How Backpropagation Works

How Backpropagation Works

Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization

Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization

1D convolution for neural networks, part 1: Sliding dot product

1D convolution for neural networks, part 1: Sliding dot product

1D convolution for neural networks, part 2: Convolution copies the kernel

1D convolution for neural networks, part 2: Convolution copies the kernel

1D convolution for neural networks, part 3: Sliding dot product equations longhand

1D convolution for neural networks, part 3: Sliding dot product equations longhand

1D convolution for neural networks, part 4: Convolution equation

1D convolution for neural networks, part 4: Convolution equation

1D convolution for neural networks, part 5: Backpropagation

1D convolution for neural networks, part 5: Backpropagation

1D convolution for neural networks, part 6: Input gradient

1D convolution for neural networks, part 6: Input gradient

1D convolution for neural networks, part 7: Weight gradient

1D convolution for neural networks, part 7: Weight gradient

1D convolution for neural networks, part 8: Padding

1D convolution for neural networks, part 8: Padding

1D convolution for neural networks, part 9: Stride

1D convolution for neural networks, part 9: Stride

The Four Grand Challenges of Robots in the Home

The Four Grand Challenges of Robots in the Home

How Convolution Works

How Convolution Works

The Softmax neural network layer

The Softmax neural network layer

Batch normalization

Batch normalization

Getting ready to learn Python, Mac edition #1: Files and directories

Getting ready to learn Python, Mac edition #1: Files and directories

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Data Science

The Python Dictionary Trick That Makes Interviewers Smile

Learn the Python dictionary trick that impresses interviewers and improves your coding skills

Dev.to · Ameer Abdullah

I Compared 50 Python Courses. Here Are My Top 5 Recommendations for 2026

Discover the top 5 Python courses for 2026, curated from a comparison of 50 courses, to enhance your programming skills and career prospects

Medium · Python

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB