How optimization for machine learning works, part 1

Brandon Rohrer · Intermediate ·📐 ML Fundamentals ·7y ago

Key Takeaways

This video introduces the basics of optimization for machine learning, including cost functions and gradient descent

Full Transcript

optimization is a fancy word for finding the best way we can see how it works if we take a look at drinking tea there's a best temperature for tea if your tea is too hot it'll scald your tongue and you won't be able to taste anything for days if it's lukewarm it's entirely unsatisfying and there's this sweet spot in the Middle where it's comfortably hot warming you from the inside out all the way down your throat and radiating through your belly this is the ideal temperature for tea this happy medium is what we try to find in optimization that's what goldilock was looking for when she tried Papa Bear's bed and found it too hard tried Mama Bear's bed and found it too soft and then tried baby bear's bed and found it to be just right finding how to get things just right turns out to be a very common problem mathematicians and computer scientists love it because it's very specific it's well formulated you know when you've got it right and you can compare Your solution against others to see who got it right faster when a computer scientist tries to find the right temperature for tea the first thing they do is flip the problem upside down instead of trying to maximize tea drinking enjoyment they try to minimize suffering while drinking tea the results the same and the math works out in the same way it's not that all computer scientists are pessimist s but just that most optimization problems are naturally described in terms of costs money time resources rather than benefits in math it's convenient to make all your problems look the same before you work out a solution so you can solve it just the one time in machine learning this is often called an error function because error is the undesirable thing the suffering it's being minimized it can also be called a cost function or a loss function or an energy function but they all mean pretty much the same thing there are a handful of ways to go about finding the best temperature for serving tea the most obvious is just to look at the curve and pick the lowest point unfortunately we don't actually know what the curve is when we start out this is implicit in the optimization problem but we can make use of our original idea and just measure the Curve we can prepare a cup of tea at a given temperature serve it and ask our unwitting test subject how they enjoyed it then we can repeat this process for every temperature across the whole range that we care about by the time we're done with this we do know what the whole curve looks like and then we can just pick the temperature for which our tea drinker reported the most enjoyment or the least suffering this way of finding the best tea temperature is called exhaustive search it's straightforward and effective but it might take a while if our time is limited it's worth it to check out a few other methods if you imagine that our T suffering curve is actually a physical Bowl then we could easily find the Bottom by dropping a marble in and letting it roll until it stops this is the intuition behind gradient descent literally going downhill to use gradient descent we start at an arbitrary temperature before beginning we don't know anything about our curve so we make a random guess and we brew a cup of tea at that temperature and see how well our tea drinker likes it from there the next trip is to figure out which direction is downhill and which is up to figure this out we choose a direction again arbitrarily and we choose a new temperature a very small distance away let's say we choose to the left cooler temperatures then we Brew up another cup of tea at this slightly lower temperature and see whether or not it's better than the first we discover that it's actually inferior our tea drinker likes it less now we know that downhill is to the right that we need to make our next cup warmer to make it better we take a larger step in the direction of warmer tea Brew up a new cup and start the process over again and then we repeat this we do it over and over until we get to the very best temperature for tea the steeper the slope the larger the step we can take and we'll know that we're all done when we take a small step away and get the exact same level of enjoyment for our tea drinker this can only happen at the bottom of the bowl where it's flat and there is no downhill there are lots of gradient descent methods most of them are clever ways to measure the slope as efficiently as possible and to get to the bottom of the bowl in as few steps as possible they're all tricks to brew as few Cups of Tea as we can get away with they use different tricks to avoid completely calculating the slope or to choose a step size that is as large as can be gotten away with but the underlying intuition is the same one of the tricks to find the bottom of the bowl in fewer steps is to use not just the slope but also curvature when deciding how big of a step to take as the marble starts to roll down the side of the bowl is the slope getting steeper if so then the bottom is probably still far away take a big step or is the slope getting shallower and starting to bottom out if so the bottom is probably getting closer take smaller steps now curvature this slope of the slope or Hess to give it its rightful name can be very helpful if you're trying to take as few steps as possible however it can also be much more expensive to compute this is a trade-off that comes up a lot in optimization we end up choosing between the number of steps we have to take and how hard it is to compute where the next step should be like a lot of math problems the more assumptions you're able to make the better the solution you can come up with unfortunately when working with real data a lot of those assumptions don't always apply there are a lot of ways that this drop a marble approach can fail if there's more than one Valley for a marble to roll into we might miss the deepest one each of these little bowls or valleys is called a local minimum we are interested in finding the global minimum the deepest of all the bowls the lowest of all the local Minima imagine that we're testing our tea temperatures on a hot day it may be that once the tea becomes cold enough it makes a great iced tea which is even more popular but we could never find that out by gradient descent alone also if the error function is not smooth there are lots of places a marble could get stuck this could happen if our tea drinker's enjoyment was heavily impacted by passing trains for instance the periodic occurrence of trains could introduce a wiggle into our data if the error function you're trying to optimize makes discrete jumps that presents a challenge too marbles don't roll downstairs well this could happen for example if our tea drinkers have to rate their enjoyment on a 10-point scale if the error function is mostly a plateau but has a bottom that's narrow and deep then the marble is unlikely to find it perhaps our tea drinkers are very finicky and absolutely despise all tea that is anything but Perfect all of these occur in real machine learning optimization problems if we suspect that our T satisfaction curve has any of these tricky characteristics we can always fall back to exhaustive Sur unfortunately exhaustive surge takes an extremely long time for a lot of problems but luckily for us there's a middle ground there's a set of methods that's tougher than gradient descent they go by names like genetic algorithms evolutionary algorithms and simulated analing they take longer to compute and they take more steps but they don't break nearly so easily each has its own quirks but one characteristic that most of them share is a Randomness to their steps and jumps this helps them discover the deepest valleys of the error function even when they're harder to find optimization algorithms that rely on gradient descent are kind of like Formula 1 race cars they're extremely fast and efficient but they require a very well- behaved track to work well or error function to work well a poorly placed speed bump can wreck it the more robust methods are like four-wheel drive pickup trucks they don't go nearly as fast but they can handle a lot more variability in the terrain and exhaustive search is like traveling on foot you can get absolutely anywhere but it may take you a really long time they're each invaluable in different situations now that we've talked about how optimization Works click the link below to join me for part two of this series where we step through an example of optimization in a machine learning model

Original Description

Part of the End-to-End Machine Learning School course library at http://e2eml.school See these concepts used in an End to End Machine Learning project: https://end-to-end-machine-learning.teachable.com/p/polynomial-regression-optimization/ Watch the rest of the How Optimization Works series: https://end-to-end-machine-learning.teachable.com/p/building-blocks-how-optimization-works/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Brandon Rohrer · Brandon Rohrer · 33 of 60

1 Robot Learning with a Biologically-Inspired Brain (BECCA)
Robot Learning with a Biologically-Inspired Brain (BECCA)
Brandon Rohrer
2 BECCA talk at AGI 2011
BECCA talk at AGI 2011
Brandon Rohrer
3 Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel
Robot Learning with a Biologically-Inspired Brain (BECCA), The Sequel
Brandon Rohrer
4 BECCA listens to The Hobbit
BECCA listens to The Hobbit
Brandon Rohrer
5 Learning the building blocks of speech: BECCA extracts a hierarchy of audio features
Learning the building blocks of speech: BECCA extracts a hierarchy of audio features
Brandon Rohrer
6 BECCA listens for sound effects in The Hobbit
BECCA listens for sound effects in The Hobbit
Brandon Rohrer
7 BECCA finds movie trailers while watching the Big Bang Theory
BECCA finds movie trailers while watching the Big Bang Theory
Brandon Rohrer
8 Listening for unexpected sounds: BECCA detects anomalies in audio data
Listening for unexpected sounds: BECCA detects anomalies in audio data
Brandon Rohrer
9 Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features
Learning the building blocks of vision: BECCA extracts a spatio-temporal hierarchy of features
Brandon Rohrer
10 Watching for the unexpected: BECCA detects anomalies in video data
Watching for the unexpected: BECCA detects anomalies in video data
Brandon Rohrer
11 BECCA finds a stationary target
BECCA finds a stationary target
Brandon Rohrer
12 BECCA finds a stationary target at 3X speed
BECCA finds a stationary target at 3X speed
Brandon Rohrer
13 BECCA watches the X-men and Bruce Lee
BECCA watches the X-men and Bruce Lee
Brandon Rohrer
14 BECCA plays Quidditch
BECCA plays Quidditch
Brandon Rohrer
15 BECCA chases a ball
BECCA chases a ball
Brandon Rohrer
16 BECCA chases a ball, part 2
BECCA chases a ball, part 2
Brandon Rohrer
17 Becca chases a ball, part 3
Becca chases a ball, part 3
Brandon Rohrer
18 BECCA creates features from MNIST
BECCA creates features from MNIST
Brandon Rohrer
19 How reinforcement learning works in Becca 7
How reinforcement learning works in Becca 7
Brandon Rohrer
20 Deep Learning Demystified
Deep Learning Demystified
Brandon Rohrer
21 How Data Science Works
How Data Science Works
Brandon Rohrer
22 How Convolutional Neural Networks work
How Convolutional Neural Networks work
Brandon Rohrer
23 How Bayes Theorem works
How Bayes Theorem works
Brandon Rohrer
24 How Deep Neural Networks Work
How Deep Neural Networks Work
Brandon Rohrer
25 Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)
Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)
Brandon Rohrer
26 How Support Vector Machines work / How to open a black box
How Support Vector Machines work / How to open a black box
Brandon Rohrer
27 How autocorrelation works
How autocorrelation works
Brandon Rohrer
28 Getting closer to human intelligence through robotics
Getting closer to human intelligence through robotics
Brandon Rohrer
29 A minimalist's guide to slicing and indexing pandas DataFrames
A minimalist's guide to slicing and indexing pandas DataFrames
Brandon Rohrer
30 How decision trees work
How decision trees work
Brandon Rohrer
31 Data scientist archetypes
Data scientist archetypes
Brandon Rohrer
32 How to use python's datetime package
How to use python's datetime package
Brandon Rohrer
How optimization for machine learning works, part 1
How optimization for machine learning works, part 1
Brandon Rohrer
34 How optimization for machine learning works, part 2
How optimization for machine learning works, part 2
Brandon Rohrer
35 How optimization for machine learning works, part 3
How optimization for machine learning works, part 3
Brandon Rohrer
36 How optimization for machine learning works, part 4
How optimization for machine learning works, part 4
Brandon Rohrer
37 How convolutional neural networks work, in depth
How convolutional neural networks work, in depth
Brandon Rohrer
38 How to pick a machine learning model 4: Splitting the data
How to pick a machine learning model 4: Splitting the data
Brandon Rohrer
39 How to pick a machine learning model 3: Choosing a loss function
How to pick a machine learning model 3: Choosing a loss function
Brandon Rohrer
40 How to pick a machine learning model 2: Separating signal from noise
How to pick a machine learning model 2: Separating signal from noise
Brandon Rohrer
41 How to pick a machine learning model 1: Choosing between models
How to pick a machine learning model 1: Choosing between models
Brandon Rohrer
42 How to pick a machine learning model 5: Navigating assumptions
How to pick a machine learning model 5: Navigating assumptions
Brandon Rohrer
43 What do neural networks learn?
What do neural networks learn?
Brandon Rohrer
44 Interview with iRobot's Director of Data Science Angela Bassa
Interview with iRobot's Director of Data Science Angela Bassa
Brandon Rohrer
45 How Backpropagation Works
How Backpropagation Works
Brandon Rohrer
46 Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization
Evolutionary Powell's method: A discrete optimizer for hyperparameter optimization
Brandon Rohrer
47 1D convolution for neural networks, part 1: Sliding dot product
1D convolution for neural networks, part 1: Sliding dot product
Brandon Rohrer
48 1D convolution for neural networks, part 2: Convolution copies the kernel
1D convolution for neural networks, part 2: Convolution copies the kernel
Brandon Rohrer
49 1D convolution for neural networks, part 3: Sliding dot product equations longhand
1D convolution for neural networks, part 3: Sliding dot product equations longhand
Brandon Rohrer
50 1D convolution for neural networks, part 4: Convolution equation
1D convolution for neural networks, part 4: Convolution equation
Brandon Rohrer
51 1D convolution for neural networks, part 5: Backpropagation
1D convolution for neural networks, part 5: Backpropagation
Brandon Rohrer
52 1D convolution for neural networks, part 6: Input gradient
1D convolution for neural networks, part 6: Input gradient
Brandon Rohrer
53 1D convolution for neural networks, part 7: Weight gradient
1D convolution for neural networks, part 7: Weight gradient
Brandon Rohrer
54 1D convolution for neural networks, part 8: Padding
1D convolution for neural networks, part 8: Padding
Brandon Rohrer
55 1D convolution for neural networks, part 9: Stride
1D convolution for neural networks, part 9: Stride
Brandon Rohrer
56 The Four Grand Challenges of Robots in the Home
The Four Grand Challenges of Robots in the Home
Brandon Rohrer
57 How Convolution Works
How Convolution Works
Brandon Rohrer
58 The Softmax neural network layer
The Softmax neural network layer
Brandon Rohrer
59 Batch normalization
Batch normalization
Brandon Rohrer
60 Getting ready to learn Python, Mac edition #1: Files and directories
Getting ready to learn Python, Mac edition #1: Files and directories
Brandon Rohrer

Related AI Lessons

10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 crucial Python concepts to elevate your skills from intermediate to advanced and become a proficient developer
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and separate yourself from beginner developers
Medium · Python
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →