Data Science Gamechanger?
Key Takeaways
The video explores the newly released cuML accelerated scikit-learn, demonstrating its potential as a gamechanger in data science by comparing CPU and GPU training times and benchmarks.
Full Transcript
Scikitlearn is like the OG of data science Python libraries. It's a really comprehensive library that covers a lot of different algorithms, ways of pre-processing and splitting your data for evaluation. And it has these really cool pipelines that you can build where you transform and then fit models and evaluate them. And because of that, it's usually the first thing people use to teach different algorithms to new beginners in data science. and it's written in Python. And Python's not really known for being the fastest of programming languages. So if you're training really big models, sometimes some of the algorithms can be pretty slow. So I was pretty interested when I saw that Nvidia came out with a GPU accelerated version of scikitlearn. Now this is different than QML, which is Nvidia's own library for running algorithms on GPUs. It's supposedly a no code change implementation of scikitlearn. So you can keep all your scikitlearn code the same and just add one line of code and everything will run on the GPU. And as you probably already know, GPUs are super fast at doing highly parallel computation, which a lot of these algorithms are. So, I'm going to put it to the test and try running the algorithms with this GPU acceleration, see how fast things are, and then give my take on if I think this is a gamecher or not. So, this is the blog post about the scikitlearn version. But if we scroll down here and look just at this line of code, I'm pretty sure this is all that's needed to make your scikitlearn code run on a GPU. First, we do need to install Rapids on our machine. and basically just copied this command. I created a cond environment which you could see here that I have loaded where it's basically just a Python environment that I'm storing all these packages that I'm installing. So if I install these here, I can see they're all already installed. I think I also need to pip install scikitlearn. Most annoying thing about scikitlearn is that when you import it, it's sklearn. When you pip install it, it's scikitlearn. I'm also pip installing Jupiter Lab. I'm going to start it up by running Jupyter Lab. Now, I'm going to be basing some of my tests on the official getting started notebook, but I'm going to be running this locally. So, we're going to just go line by line. Let's start by doing some imports, which all work, so they are installed correctly. I can also do Nvidia Smi. This lets me see that I do have a GPU on this machine. Also, my terminal really like NV top because then you can see over time the GPU usage. So, let's load this up and we'll switch over to it when we run stuff on GPU. Now, they have in their example this data set that I'm going to download. And just to see the size, let's do DF shape. And we can see it's over half a million rows of data. So, it's pretty decently sized. And if I do a DF info on it, it's not too huge. about a quarter of a gigabyte in size in memory. Let's go ahead and do a CPU training model. And we're going to do this by just doing a train test split with 20% of the data held out for validation. And then let's go ahead and time training this random forced classifier from sklearn using my CPU. One thing to keep in mind, my video was kind of stuttering there because it was using a lot of CPU when it trained this model. Now I am running a machine with 64 threads. So this is pretty big CPU heavy machine. If I just run it again, let's see what the CPU usage looks like. So all that green meant that each of the threads was really being maxed out to train this model. So we're kind of comparing against one of the a pretty strong CPUbased machine and it took 29 seconds. Now we're just doing this for demonstration purposes. So it's not too important to run the accuracy score. Now, if you're doing something like grid search where you're trying to find the best parameters for the model, you might be running this hundreds, thousands of times to really tune it, and 30 seconds doesn't sound like a lot, but when you're adding that up over a grid search, it could actually be hours of time that you're taking to run and train this model. Okay, the moment of truth. Let's try adding this special magic command. And of course it doesn't work. Okay, so I'm back. So it turned out I did not have the CUDA toolkit installed, which I guess is important to run uh CUDA QML type stuff. And another thing I realized is that I need to run this before doing the imports of the libraries or it won't work. So now I did this cool Excel. You can see it's installed accelerator for sklearn. It's initialized accelerator. And now I can run the same code data processing as before, but this time when I train my classifier, it should run on the GPU. Look at that. 1 second and it was done training. So pretty impressive. So that is a pretty good amount of speed up. I do think it's important to note that they don't necessarily have every single algorithm implemented in GPU acceleration, but the ones that they do, they have some benchmarks out there that you can look at. So if you're doing something like K nearest neighbor or ridge regression, you're not going to see a huge speed up by jumping to a GPU. But something like the random force classifier that we saw, you can get pretty good benefits from running on a GPU. So, what's the big takeaways? I think it's pretty cool that you can run some of these algorithms on a GPU with just one line of code change. That's pretty slick. I don't necessarily think this is going to be a huge deal if you're training small models every so often. But where I do think it's kind of big deal is if you're doing some optimization for hyperparameters, like you're trying to figure out what best parameters are for your model and you're going to run it on a thousand different uh parameters to see which one's best, then having this type of speed up is pretty helpful. It also seems like the clustering algorithms really benefit from having the GPU, which makes sense because those can be highly parallelized and they often take a long time to run on CPU as it is. So definitely worth checking out. If you have a GPU that's on your machine, then you can just run this pretty easily. Otherwise, if you're running in a Google Collab notebook and it has a GPU available to it, then you can just run it there for free. So, I'm interested to hear what you guys think. Let me know in the comments.
Original Description
In this video we look at the newly released cuML accelerated scikit-learn. What do you think? Will you be using the GPU accelerated version of scikit-learn?
#sklearn #datascience
Timeline:
00:00 Intro
01:30 Env Setup
02:34 Imports and Data
03:15 CPU Training
04:38 GPU Training
05:34 Benchmarks
06:06 My Take
Links to my stuff:
* Youtube: https://youtube.com/@robmulla?sub_confirmation=1
* Discord: https://discord.gg/HZszek7DQc
* Twitch: https://www.twitch.tv/RobCodesLIVE
* Twitter: https://twitter.com/Rob_Mulla
* Kaggle: https://www.kaggle.com/robikscube
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Rob Mulla · Rob Mulla · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
A Gentle Introduction to Pandas Data Analysis (on Kaggle)
Rob Mulla
Exploratory Data Analysis with Pandas Python
Rob Mulla
7 Python Data Visualization Libraries in 15 minutes
Rob Mulla
Kaggle competition starter notebook walkthrough
Rob Mulla
Kaggle Competitions: A Beginner's Guide to Winning
Rob Mulla
Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!
Rob Mulla
Audio Data Processing in Python
Rob Mulla
Complete Data Science Project!
Rob Mulla
Make Your Pandas Code Lightning Fast
Rob Mulla
Image Processing with OpenCV and Python
Rob Mulla
Speed Up Your Pandas Dataframes
Rob Mulla
This INCREDIBLE trick will speed up your data processes.
Rob Mulla
Complete Guide to Cross Validation
Rob Mulla
Easy Python Progress Bars with tqdm
Rob Mulla
Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!
Rob Mulla
Python Sentiment Analysis Project with NLTK and 🤗 Transformers. Classify Amazon Reviews!!
Rob Mulla
Get Started with Machine Learning and AI in 2023
Rob Mulla
The Trick to Get Unlimited Datasets
Rob Mulla
Video Data Processing with Python and OpenCV
Rob Mulla
Object Detection in 10 minutes with YOLOv5 & Python!
Rob Mulla
Pandas for Data Science #shorts
Rob Mulla
Object Detection in 60 Seconds using Python and YOLOv5 #shorts
Rob Mulla
Machine Learning for Facial Recognition in Python in 60 Seconds #shorts
Rob Mulla
Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption
Rob Mulla
Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr
Rob Mulla
Solving an Impossible Riddle with Code
Rob Mulla
Do these Pandas Alternatives actually work?
Rob Mulla
Time Series Forecasting with XGBoost - Advanced Methods
Rob Mulla
Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)
Rob Mulla
Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)
Rob Mulla
Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)
Rob Mulla
25 Nooby Pandas Coding Mistakes You Should NEVER make.
Rob Mulla
DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022
Rob Mulla
More Chessboard Computer Vision AI - Data Science Uncut - Sep 13
Rob Mulla
Medallion Data Science Live Stream
Rob Mulla
Community Kaggle Competition Overview - Corn Classification (
Rob Mulla
Deep Learning Image Classification - Corn Kernels - Data Science Uncut
Rob Mulla
OpenAI Whisper Demo: Convert Speech to Text in Python
Rob Mulla
Yolov7 Custom Object Detection in Python Tutorial - Chess Piece Detection
Rob Mulla
Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022
Rob Mulla
Finding Chess Cheaters with Python! - Data Science Uncut Livestream
Rob Mulla
Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022
Rob Mulla
Flight Delay Dataset Creation (Data Science Uncut)
Rob Mulla
5 Reasons to Kaggle #shorts
Rob Mulla
♟️ Data Science - Chess Data Analysis
Rob Mulla
EXTREME PYTHON & DATA SCIENCE LIVE STREAM
Rob Mulla
What is Clustering in ML?
Rob Mulla
What is K-Nearest Neighbors?
Rob Mulla
LIVE CODING: Flight Data Exploration with Pandas & Python
Rob Mulla
Kaggle Survey vs. Twitter Sentiment
Rob Mulla
If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream
Rob Mulla
Data Visualization BATTLE!
Rob Mulla
LIVE CODING: Stocks & Sentiment Analysis
Rob Mulla
Progress Bar in Python with TQDM
Rob Mulla
Flight Cancellation Data Analysis
Rob Mulla
Synthetic Dataset Creation for Machine Learning - Blender and Python
Rob Mulla
The Ultimate Coding Setup for Data Science
Rob Mulla
Dataset Creation SPEED RUN - Live Coding With Python & Pandas
Rob Mulla
Data Wrangling with Python and Pandas LIVE
Rob Mulla
Forecasting with the FB Prophet Model
Rob Mulla
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
You Are Not Behind. The World Is.
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Medium · Programming
AI won’t replace scientists. It will make the current model of science obsolete
Medium · Data Science
Chapters (7)
Intro
1:30
Env Setup
2:34
Imports and Data
3:15
CPU Training
4:38
GPU Training
5:34
Benchmarks
6:06
My Take
🎓
Tutor Explanation
DeepCamp AI