Speed Up Your Code with Loop Hoisting : Data Science Code

ritvikmath · Intermediate ·📐 ML Fundamentals ·6y ago

Key Takeaways

This video demonstrates the use of loop hoisting to speed up code in data science, specifically in image processing tasks such as calculating mean and standard deviation for each row of pixels and normalizing pixels. The technique is applied using Python with libraries like numpy and matplotlib.

Full Transcript

alright everybody in this video we're gonna continue our code speed-up series with a technique called loop hoisting now loop hoisting is not a technique that's very difficult to understand but I still think that it's something that even the most experienced software engineers can overlook from time to time so it's gonna be important to talk about it a little bit first of all let's give a context so we're not just talking about coding for coding sake let's say we're doing some image processing and we have this image which is 700 pixels wide by 700 pixels tall and if some little creature right here now our goal is to do some kind of normalization on this image and by normalization we mean that let's say we're talking about a specific pixel let's say the one that's pointed to by the red arrow so some pixel that's part of the creatures hand may be now this pixel belongs to a row a row of pixels and this row of pixels has a mean and a standard deviation let's say the mean of the row of pixels is mu let's say the standard deviation of the row pixels is Sigma and let's say that the value of the pixel that we care about right now is X now as you might see coming we're going to normalize it or standardize it by just doing X minus the mean divided by the standard deviation so you'll remember this formula from your statistics classes as the z-score or normalization pretty simple to understand okay and we want to do this for every single pixel in the image so for any given pixel that pixel belongs to a row we're going to take the mean of the pixels in that row we're going to take the standard deviation of pixels in that row and we're going to do this simple transformation to normalize the entire image so hopefully the idea is clear let's look at this code which proposed and let's see if it's correct and if it is correct let's see if it's efficient or if we can make it faster so let me explain the code as I've written it we have a double for loop one is looping through all of the rows so the rows given my I for row in 1 2 all the way to the 700th row the inner for loop loops over all of the columns given some rows so J from 1 to 700 again so these two for loops together just iterate over every pixel in the entire image now for any given pixel we have a couple a little bit of work to do we'll calculate the mean of the row that the pixel belongs to so this my image at index I is basically saying give me the I throw of my image and take the mean of it store that mean in this variable called row mean do the same thing for standard deviation for the I throw of the image take the standard deviation and store it in a variable called row SD row standard deviation and then simply for this given pixel just take it subtract the mean of its row divided by the standard deviation of its row and store that transform value back into the pixel and that should solve my problem right it seems reasonable it seems like we're doing exactly what we outline mathematically but when we stop and think about the code and think about where these two statements are living right now compared to where they should actually be living we start to see some problems there's definitely an efficiency concern which you might have seen already but there's also first and foremost a correctness concern this code isn't actually going to work the way we think it's going to work and why is that just to look at that let's look at a given row let's say we're dealing with some row of pixels and of course there's 700 pixels in that row for the first pixel in that row there's actually not a problem because we're going to come in to these two for loops we're gonna calculate the mean of this row we're gonna calculate the standard deviation of this row and we're gonna reset the first pixel as its transformed value no problem now what happens with the second pixel with the second pixel we come back inside these two for loops we recompute the mean of the row and the standard deviation of the row but notice these values have changed they're not the original mean and standard deviation of that row before we did anything they've changed slightly because since we changed that first pixel we've edited these values ever so slightly which means that the new normalized value of that second pixel isn't going to be exactly correct and the problem just gets worse and worse and worse as we move to further pixels down this row we're basically taking into account all the transformed values in this row from before rather than the original values which we wanted to do and so that as we progress through this row we're just getting more and more wrong results as we go forward so that's the correctness issue and the efficiency issue also is very tied to this the efficiency issue is this we come inside these two four loops and we compute a mean and we compute a standard deviation now notice things both are just functions of AI which makes sense right because we're saying the mean of the row and the mean of the standard deviation is known once you know what the row is you don't need to know what a column you're in according to what we're trying to do here so why is it that we're calculating these 700 times when you all you need to know is the row as soon as you know which eye you're in you can go ahead and compute these two guys and then use those two values for every single column that you're in in fact by doing it this way not only are we wrong in our final result we're also very inefficient because we're calculating row mean and rows standard deviation 700 times we're basically 700 times for each row working even though we should just calculate at one time so here's where the hoisting part comes in we're gonna take these two statements here and hoist them out to this level of the two for loops which means right after we know which row are in right after we pick RI we're gonna calculate the mean and standard deviation of that row and use those values inside this part of the for loop which still is nested two layers deep and so that solves our efficiency problem because we're not calculating something 700 times anymore we're just calculating it one time per row and it also solves our correctness issue because now when we do this transformation we're using the row mean and rows standard deviation that were calculated at the very beginning as soon as we knew which row where you're in so we are not updating these values over time they're fixed once we know the row and then we use that value over over and over again so now we're doing a true normalization rather than this problematic normalization from before so hopefully that was easier easy for you to see how hoisting some set of statements from one level of a nested for loop to some level above or some levels above if you have like three or four nested for-loops can really help speed up your code and also solve these correctness issues that are not always easy to spot but don't take my word for it let's go ahead and look at the code and see what kind of speed ups we can achieve alright so let's take a code look at how loop hoisting can help us speed up our code I'm going to be using Python again for this analysis we're doing some image processing type stuff so kind of exciting let me go through the imports so we're going to import matplotlib dot image as MP IMG this helps us read and write images numpy helps deal with lists time is going to help us measure the time of each one and matplotlib top high plot as plc helps us show images again so first thing we'll read in this image called mouse jpg we show the image so this is the image we'll be working on let's look at the dimensions it's a 700 700 pixels by 700 pixels by 3 channels so if we're not familiar with how images work this picture is 700 pixels tall or long 700 pixels wide and each pixel has three values red green and blue and different combinations of those values give you the different colors you're seeing in the image so it's not too important that you understand that but it is cool so attempt one remember we're trying to normalize each pixel relative to its row so we go ahead and create a copy of this image so we don't act on the original one we start the timer and we see this double for loop as we did before so for I in range 700 so going over all the rows going over all the columns with our j and arrange 700 we get the mean and standard deviation pixel values of that row so we get row mean is equal to the mean of this rows pixels so my IMG with the subscript I means giving the entire row so that's 700 pixels of any given row in the image we get the mean of that rows pixels we also get the standard deviation of that rows pixels why do I add this point 0 0 1 the reason is because if the standard deviation is 0 when I try to divide by it I'm going to have a bad time so we just give it a little bit of extra boost and then we scaled this pixel by its rows mean value so we say that in the copy over image that new pixel at IJ is going to be the original pixel at IJ minus the row mean divided by the standard deviation of the row so you can think of this is z-score kind of thing in your stats class or just a normalization type thing however you want to think about it so we do that for all of the pixels we measure how long it took and this run took about 56 seconds to get its work done now remember there's a big inefficiency here when we calculate the row mean and rows standard deviation they're only functions of I yet I'm calculating them for every single J even when is fake so in the first iteration when I is 0 then J goes from 0 to 699 I'm calculating row mean in row standard deviation this 700 times even though I only had to calculate it one time once my I was 0 so that's where we're going to speed it up and just before I go to the speed-up if we show what that image looks like it looks like this so it looks kind of like a negative maybe so attempt 2 is normalized each pixel relative to its row again the only change we make is we take those two lines which were happening in every iteration of these two for loops and move it out one layer so that it's only happening once based on when I changes so if I is zero we calculate the row mean and row standard deviation for the zeroeth row or the first row I guess and then we calculate this my image copy IJ using those values for every single J from 0 to 699 that's the only change we made we calculate the time again we see that we went from almost a minute to just about three seconds so we're getting a crazy speed up here as well and we showed the image we see it's the exact same image we got before so the correctness is true in both cases it's just that in one case it took us about 20 times faster all right so this is just a quick idea of how loop hoisting can help speed up your code and until next time

Original Description

How to use loop hoisting to speed up your code!
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from ritvikmath · ritvikmath · 0 of 60

← Previous Next →
1 Math Team Update
Math Team Update
ritvikmath
2 Single Variable Calculus Volume of a Sphere - Proof 1
Single Variable Calculus Volume of a Sphere - Proof 1
ritvikmath
3 Single Variable Calculus Volume of a Sphere - Proof 2
Single Variable Calculus Volume of a Sphere - Proof 2
ritvikmath
4 Multivariable Calculus Volume of a Sphere Proof - Triple Integrals
Multivariable Calculus Volume of a Sphere Proof - Triple Integrals
ritvikmath
5 Multivariable Calculus Volume of a Sphere Proof - Double Integrals
Multivariable Calculus Volume of a Sphere Proof - Double Integrals
ritvikmath
6 The Euclidian Algorithm
The Euclidian Algorithm
ritvikmath
7 Proving the Chain Rule
Proving the Chain Rule
ritvikmath
8 Proving the Fundamental Theorem of Calculus Part 1
Proving the Fundamental Theorem of Calculus Part 1
ritvikmath
9 Proving the Fundamental Theorem of Calculus Part 2
Proving the Fundamental Theorem of Calculus Part 2
ritvikmath
10 Math Puzzle - Poison Perplexity
Math Puzzle - Poison Perplexity
ritvikmath
11 Math Puzzle - Poison Perplexity - Solution
Math Puzzle - Poison Perplexity - Solution
ritvikmath
12 Expected Value and Variance of Continuous Random Variables (Calculus)
Expected Value and Variance of Continuous Random Variables (Calculus)
ritvikmath
13 Expected Value and Variance of Discrete Random Variables (No Calculus)
Expected Value and Variance of Discrete Random Variables (No Calculus)
ritvikmath
14 Array Method
Array Method
ritvikmath
15 Complex Power Series and their Derivatives
Complex Power Series and their Derivatives
ritvikmath
16 Distributions - Intro
Distributions - Intro
ritvikmath
17 The Poisson Distribution
The Poisson Distribution
ritvikmath
18 The Bernoulli Distribution
The Bernoulli Distribution
ritvikmath
19 The Binomial Distribution
The Binomial Distribution
ritvikmath
20 The Continuous Uniform Distribution
The Continuous Uniform Distribution
ritvikmath
21 The Geometric Distribution
The Geometric Distribution
ritvikmath
22 The Triangular Distribution
The Triangular Distribution
ritvikmath
23 The Exponential Distribution
The Exponential Distribution
ritvikmath
24 The Borel Distribution + Notes on Poisson Distribution
The Borel Distribution + Notes on Poisson Distribution
ritvikmath
25 The Gamma Distribution
The Gamma Distribution
ritvikmath
26 The Normal Distribution
The Normal Distribution
ritvikmath
27 The Laplace Distribution
The Laplace Distribution
ritvikmath
28 The Chi - Squared Distribution
The Chi - Squared Distribution
ritvikmath
29 Overfitting
Overfitting
ritvikmath
30 Vector Norms
Vector Norms
ritvikmath
31 Truths Behind the Titanic : K-Nearest Neighbor
Truths Behind the Titanic : K-Nearest Neighbor
ritvikmath
32 The Mathematics of Breakups
The Mathematics of Breakups
ritvikmath
33 Sillyfish
Sillyfish
ritvikmath
34 Finding Optimal Paths - Dynamic Programming
Finding Optimal Paths - Dynamic Programming
ritvikmath
35 HowToDataScience : Scraping Twitter Data
HowToDataScience : Scraping Twitter Data
ritvikmath
36 Decision Trees
Decision Trees
ritvikmath
37 Perceptron
Perceptron
ritvikmath
38 Naive Bayes
Naive Bayes
ritvikmath
39 K-Nearest Neighbor
K-Nearest Neighbor
ritvikmath
40 Evaluating Machine Learning Models
Evaluating Machine Learning Models
ritvikmath
41 Decision Tree Pruning
Decision Tree Pruning
ritvikmath
42 K-Means Clustering
K-Means Clustering
ritvikmath
43 Gaussian Mixture Model
Gaussian Mixture Model
ritvikmath
44 Data Science - Fuzzy Record Matching
Data Science - Fuzzy Record Matching
ritvikmath
45 Time Series Talk : Autocorrelation and Partial Autocorrelation
Time Series Talk : Autocorrelation and Partial Autocorrelation
ritvikmath
46 Time Series Talk : Autoregressive Model
Time Series Talk : Autoregressive Model
ritvikmath
47 Time Series Talk : Moving Average Model
Time Series Talk : Moving Average Model
ritvikmath
48 Time Series Talk : ARMA Model
Time Series Talk : ARMA Model
ritvikmath
49 Time Series Talk : ARCH Model
Time Series Talk : ARCH Model
ritvikmath
50 Time Series Talk : White Noise
Time Series Talk : White Noise
ritvikmath
51 Time Series Talk : Stationarity
Time Series Talk : Stationarity
ritvikmath
52 Time Series Talk : ARIMA Model
Time Series Talk : ARIMA Model
ritvikmath
53 Time Series Talk : Lag Operator
Time Series Talk : Lag Operator
ritvikmath
54 Time Series Talk : What is Seasonality ?
Time Series Talk : What is Seasonality ?
ritvikmath
55 Time Series Talk : Seasonal ARIMA Model
Time Series Talk : Seasonal ARIMA Model
ritvikmath
56 So ... What Actually is a Matrix ? : Data Science Basics
So ... What Actually is a Matrix ? : Data Science Basics
ritvikmath
57 Derivative of a Matrix : Data Science Basics
Derivative of a Matrix : Data Science Basics
ritvikmath
58 Basics of PCA (Principal Component Analysis) : Data Science Concepts
Basics of PCA (Principal Component Analysis) : Data Science Concepts
ritvikmath
59 Eigenvalues & Eigenvectors : Data Science Basics
Eigenvalues & Eigenvectors : Data Science Basics
ritvikmath
60 The Covariance Matrix : Data Science Basics
The Covariance Matrix : Data Science Basics
ritvikmath

This video teaches how to use loop hoisting to speed up code in data science, specifically in image processing tasks, and demonstrates its application using Python with libraries like numpy and matplotlib. Loop hoisting can significantly improve the efficiency and correctness of the code. By applying this technique, data scientists can optimize their code for better performance.

Key Takeaways
  1. Iterate over every pixel in the image using two nested for loops
  2. Calculate the mean of each row of pixels
  3. Calculate the standard deviation of each row of pixels
  4. Normalize each pixel by subtracting mean and dividing by standard deviation
  5. Hoist the calculation of mean and standard deviation of a row to the outer level of the nested for loops
  6. Use the fixed row mean and standard deviation values for normalization in each iteration
💡 Loop hoisting can significantly improve the efficiency and correctness of the code by reordering loops and reducing redundant calculations.

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →