Speed Up Your Code with Loop Hoisting : Data Science Code

ritvikmath · Intermediate ·📐 ML Fundamentals ·6y ago

Skills: ML Maths Basics90%

Key Takeaways

This video demonstrates the use of loop hoisting to speed up code in data science, specifically in image processing tasks such as calculating mean and standard deviation for each row of pixels and normalizing pixels. The technique is applied using Python with libraries like numpy and matplotlib.

Full Transcript

alright everybody in this video we're gonna continue our code speed-up series with a technique called loop hoisting now loop hoisting is not a technique that's very difficult to understand but I still think that it's something that even the most experienced software engineers can overlook from time to time so it's gonna be important to talk about it a little bit first of all let's give a context so we're not just talking about coding for coding sake let's say we're doing some image processing and we have this image which is 700 pixels wide by 700 pixels tall and if some little creature right here now our goal is to do some kind of normalization on this image and by normalization we mean that let's say we're talking about a specific pixel let's say the one that's pointed to by the red arrow so some pixel that's part of the creatures hand may be now this pixel belongs to a row a row of pixels and this row of pixels has a mean and a standard deviation let's say the mean of the row of pixels is mu let's say the standard deviation of the row pixels is Sigma and let's say that the value of the pixel that we care about right now is X now as you might see coming we're going to normalize it or standardize it by just doing X minus the mean divided by the standard deviation so you'll remember this formula from your statistics classes as the z-score or normalization pretty simple to understand okay and we want to do this for every single pixel in the image so for any given pixel that pixel belongs to a row we're going to take the mean of the pixels in that row we're going to take the standard deviation of pixels in that row and we're going to do this simple transformation to normalize the entire image so hopefully the idea is clear let's look at this code which proposed and let's see if it's correct and if it is correct let's see if it's efficient or if we can make it faster so let me explain the code as I've written it we have a double for loop one is looping through all of the rows so the rows given my I for row in 1 2 all the way to the 700th row the inner for loop loops over all of the columns given some rows so J from 1 to 700 again so these two for loops together just iterate over every pixel in the entire image now for any given pixel we have a couple a little bit of work to do we'll calculate the mean of the row that the pixel belongs to so this my image at index I is basically saying give me the I throw of my image and take the mean of it store that mean in this variable called row mean do the same thing for standard deviation for the I throw of the image take the standard deviation and store it in a variable called row SD row standard deviation and then simply for this given pixel just take it subtract the mean of its row divided by the standard deviation of its row and store that transform value back into the pixel and that should solve my problem right it seems reasonable it seems like we're doing exactly what we outline mathematically but when we stop and think about the code and think about where these two statements are living right now compared to where they should actually be living we start to see some problems there's definitely an efficiency concern which you might have seen already but there's also first and foremost a correctness concern this code isn't actually going to work the way we think it's going to work and why is that just to look at that let's look at a given row let's say we're dealing with some row of pixels and of course there's 700 pixels in that row for the first pixel in that row there's actually not a problem because we're going to come in to these two for loops we're gonna calculate the mean of this row we're gonna calculate the standard deviation of this row and we're gonna reset the first pixel as its transformed value no problem now what happens with the second pixel with the second pixel we come back inside these two for loops we recompute the mean of the row and the standard deviation of the row but notice these values have changed they're not the original mean and standard deviation of that row before we did anything they've changed slightly because since we changed that first pixel we've edited these values ever so slightly which means that the new normalized value of that second pixel isn't going to be exactly correct and the problem just gets worse and worse and worse as we move to further pixels down this row we're basically taking into account all the transformed values in this row from before rather than the original values which we wanted to do and so that as we progress through this row we're just getting more and more wrong results as we go forward so that's the correctness issue and the efficiency issue also is very tied to this the efficiency issue is this we come inside these two four loops and we compute a mean and we compute a standard deviation now notice things both are just functions of AI which makes sense right because we're saying the mean of the row and the mean of the standard deviation is known once you know what the row is you don't need to know what a column you're in according to what we're trying to do here so why is it that we're calculating these 700 times when you all you need to know is the row as soon as you know which eye you're in you can go ahead and compute these two guys and then use those two values for every single column that you're in in fact by doing it this way not only are we wrong in our final result we're also very inefficient because we're calculating row mean and rows standard deviation 700 times we're basically 700 times for each row working even though we should just calculate at one time so here's where the hoisting part comes in we're gonna take these two statements here and hoist them out to this level of the two for loops which means right after we know which row are in right after we pick RI we're gonna calculate the mean and standard deviation of that row and use those values inside this part of the for loop which still is nested two layers deep and so that solves our efficiency problem because we're not calculating something 700 times anymore we're just calculating it one time per row and it also solves our correctness issue because now when we do this transformation we're using the row mean and rows standard deviation that were calculated at the very beginning as soon as we knew which row where you're in so we are not updating these values over time they're fixed once we know the row and then we use that value over over and over again so now we're doing a true normalization rather than this problematic normalization from before so hopefully that was easier easy for you to see how hoisting some set of statements from one level of a nested for loop to some level above or some levels above if you have like three or four nested for-loops can really help speed up your code and also solve these correctness issues that are not always easy to spot but don't take my word for it let's go ahead and look at the code and see what kind of speed ups we can achieve alright so let's take a code look at how loop hoisting can help us speed up our code I'm going to be using Python again for this analysis we're doing some image processing type stuff so kind of exciting let me go through the imports so we're going to import matplotlib dot image as MP IMG this helps us read and write images numpy helps deal with lists time is going to help us measure the time of each one and matplotlib top high plot as plc helps us show images again so first thing we'll read in this image called mouse jpg we show the image so this is the image we'll be working on let's look at the dimensions it's a 700 700 pixels by 700 pixels by 3 channels so if we're not familiar with how images work this picture is 700 pixels tall or long 700 pixels wide and each pixel has three values red green and blue and different combinations of those values give you the different colors you're seeing in the image so it's not too important that you understand that but it is cool so attempt one remember we're trying to normalize each pixel relative to its row so we go ahead and create a copy of this image so we don't act on the original one we start the timer and we see this double for loop as we did before so for I in range 700 so going over all the rows going over all the columns with our j and arrange 700 we get the mean and standard deviation pixel values of that row so we get row mean is equal to the mean of this rows pixels so my IMG with the subscript I means giving the entire row so that's 700 pixels of any given row in the image we get the mean of that rows pixels we also get the standard deviation of that rows pixels why do I add this point 0 0 1 the reason is because if the standard deviation is 0 when I try to divide by it I'm going to have a bad time so we just give it a little bit of extra boost and then we scaled this pixel by its rows mean value so we say that in the copy over image that new pixel at IJ is going to be the original pixel at IJ minus the row mean divided by the standard deviation of the row so you can think of this is z-score kind of thing in your stats class or just a normalization type thing however you want to think about it so we do that for all of the pixels we measure how long it took and this run took about 56 seconds to get its work done now remember there's a big inefficiency here when we calculate the row mean and rows standard deviation they're only functions of I yet I'm calculating them for every single J even when is fake so in the first iteration when I is 0 then J goes from 0 to 699 I'm calculating row mean in row standard deviation this 700 times even though I only had to calculate it one time once my I was 0 so that's where we're going to speed it up and just before I go to the speed-up if we show what that image looks like it looks like this so it looks kind of like a negative maybe so attempt 2 is normalized each pixel relative to its row again the only change we make is we take those two lines which were happening in every iteration of these two for loops and move it out one layer so that it's only happening once based on when I changes so if I is zero we calculate the row mean and row standard deviation for the zeroeth row or the first row I guess and then we calculate this my image copy IJ using those values for every single J from 0 to 699 that's the only change we made we calculate the time again we see that we went from almost a minute to just about three seconds so we're getting a crazy speed up here as well and we showed the image we see it's the exact same image we got before so the correctness is true in both cases it's just that in one case it took us about 20 times faster all right so this is just a quick idea of how loop hoisting can help speed up your code and until next time

Original Description

How to use loop hoisting to speed up your code!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from ritvikmath · ritvikmath · 0 of 60

← Previous Next →

Math Team Update

Math Team Update

Single Variable Calculus Volume of a Sphere - Proof 1

Single Variable Calculus Volume of a Sphere - Proof 1

Single Variable Calculus Volume of a Sphere - Proof 2

Single Variable Calculus Volume of a Sphere - Proof 2

Multivariable Calculus Volume of a Sphere Proof - Triple Integrals

Multivariable Calculus Volume of a Sphere Proof - Triple Integrals

Multivariable Calculus Volume of a Sphere Proof - Double Integrals

Multivariable Calculus Volume of a Sphere Proof - Double Integrals

The Euclidian Algorithm

The Euclidian Algorithm

Proving the Chain Rule

Proving the Chain Rule

Proving the Fundamental Theorem of Calculus Part 1

Proving the Fundamental Theorem of Calculus Part 1

Proving the Fundamental Theorem of Calculus Part 2

Proving the Fundamental Theorem of Calculus Part 2

Math Puzzle - Poison Perplexity

Math Puzzle - Poison Perplexity

Math Puzzle - Poison Perplexity - Solution

Math Puzzle - Poison Perplexity - Solution

Expected Value and Variance of Continuous Random Variables (Calculus)

Expected Value and Variance of Continuous Random Variables (Calculus)

Expected Value and Variance of Discrete Random Variables (No Calculus)

Expected Value and Variance of Discrete Random Variables (No Calculus)

Complex Power Series and their Derivatives

Complex Power Series and their Derivatives

Distributions - Intro

Distributions - Intro

The Poisson Distribution

The Poisson Distribution

The Bernoulli Distribution

The Bernoulli Distribution

The Binomial Distribution

The Binomial Distribution

The Continuous Uniform Distribution

The Continuous Uniform Distribution

The Geometric Distribution

The Geometric Distribution

The Triangular Distribution

The Triangular Distribution

The Exponential Distribution

The Exponential Distribution

The Borel Distribution + Notes on Poisson Distribution

The Borel Distribution + Notes on Poisson Distribution

The Gamma Distribution

The Gamma Distribution

The Normal Distribution

The Normal Distribution

The Laplace Distribution

The Laplace Distribution

The Chi - Squared Distribution

The Chi - Squared Distribution

Truths Behind the Titanic : K-Nearest Neighbor

Truths Behind the Titanic : K-Nearest Neighbor

The Mathematics of Breakups

The Mathematics of Breakups

Finding Optimal Paths - Dynamic Programming

Finding Optimal Paths - Dynamic Programming

HowToDataScience : Scraping Twitter Data

HowToDataScience : Scraping Twitter Data

K-Nearest Neighbor

K-Nearest Neighbor

Evaluating Machine Learning Models

Evaluating Machine Learning Models

Decision Tree Pruning

Decision Tree Pruning

K-Means Clustering

K-Means Clustering

Gaussian Mixture Model

Gaussian Mixture Model

Data Science - Fuzzy Record Matching

Data Science - Fuzzy Record Matching

Time Series Talk : Autocorrelation and Partial Autocorrelation

Time Series Talk : Autocorrelation and Partial Autocorrelation

Time Series Talk : Autoregressive Model

Time Series Talk : Autoregressive Model

Time Series Talk : Moving Average Model

Time Series Talk : Moving Average Model

Time Series Talk : ARMA Model

Time Series Talk : ARMA Model

Time Series Talk : ARCH Model

Time Series Talk : ARCH Model

Time Series Talk : White Noise

Time Series Talk : White Noise

Time Series Talk : Stationarity

Time Series Talk : Stationarity

Time Series Talk : ARIMA Model

Time Series Talk : ARIMA Model

Time Series Talk : Lag Operator

Time Series Talk : Lag Operator

Time Series Talk : What is Seasonality ?

Time Series Talk : What is Seasonality ?

Time Series Talk : Seasonal ARIMA Model

Time Series Talk : Seasonal ARIMA Model

So ... What Actually is a Matrix ? : Data Science Basics

So ... What Actually is a Matrix ? : Data Science Basics

Derivative of a Matrix : Data Science Basics

Derivative of a Matrix : Data Science Basics

Basics of PCA (Principal Component Analysis) : Data Science Concepts

Basics of PCA (Principal Component Analysis) : Data Science Concepts

Eigenvalues & Eigenvectors : Data Science Basics

Eigenvalues & Eigenvectors : Data Science Basics

The Covariance Matrix : Data Science Basics

The Covariance Matrix : Data Science Basics

This video teaches how to use loop hoisting to speed up code in data science, specifically in image processing tasks, and demonstrates its application using Python with libraries like numpy and matplotlib. Loop hoisting can significantly improve the efficiency and correctness of the code. By applying this technique, data scientists can optimize their code for better performance.

Key Takeaways

Iterate over every pixel in the image using two nested for loops
Calculate the mean of each row of pixels
Calculate the standard deviation of each row of pixels
Normalize each pixel by subtracting mean and dividing by standard deviation
Hoist the calculation of mean and standard deviation of a row to the outer level of the nested for loops
Use the fixed row mean and standard deviation values for normalization in each iteration

💡 Loop hoisting can significantly improve the efficiency and correctness of the code by reordering loops and reducing redundant calculations.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Why Your Python Loops Are Creating the Wrong Functions

Learn why Python loops create functions with the same value and how to fix it using default argument capture and factory functions

Answer Calculator: Step-by-Step Math Help

Learn to use an Answer Calculator for step-by-step math help, making it a valuable tool for late-night studying or work

Learn Deep Learning by Hand (Beginner's Guide - Part 1)