Data Science Gamechanger?

Rob Mulla · Beginner ·📰 AI News & Updates ·1y ago

Key Takeaways

The video explores the newly released cuML accelerated scikit-learn, demonstrating its potential as a gamechanger in data science by comparing CPU and GPU training times and benchmarks.

Full Transcript

Scikitlearn is like the OG of data science Python libraries. It's a really comprehensive library that covers a lot of different algorithms, ways of pre-processing and splitting your data for evaluation. And it has these really cool pipelines that you can build where you transform and then fit models and evaluate them. And because of that, it's usually the first thing people use to teach different algorithms to new beginners in data science. and it's written in Python. And Python's not really known for being the fastest of programming languages. So if you're training really big models, sometimes some of the algorithms can be pretty slow. So I was pretty interested when I saw that Nvidia came out with a GPU accelerated version of scikitlearn. Now this is different than QML, which is Nvidia's own library for running algorithms on GPUs. It's supposedly a no code change implementation of scikitlearn. So you can keep all your scikitlearn code the same and just add one line of code and everything will run on the GPU. And as you probably already know, GPUs are super fast at doing highly parallel computation, which a lot of these algorithms are. So, I'm going to put it to the test and try running the algorithms with this GPU acceleration, see how fast things are, and then give my take on if I think this is a gamecher or not. So, this is the blog post about the scikitlearn version. But if we scroll down here and look just at this line of code, I'm pretty sure this is all that's needed to make your scikitlearn code run on a GPU. First, we do need to install Rapids on our machine. and basically just copied this command. I created a cond environment which you could see here that I have loaded where it's basically just a Python environment that I'm storing all these packages that I'm installing. So if I install these here, I can see they're all already installed. I think I also need to pip install scikitlearn. Most annoying thing about scikitlearn is that when you import it, it's sklearn. When you pip install it, it's scikitlearn. I'm also pip installing Jupiter Lab. I'm going to start it up by running Jupyter Lab. Now, I'm going to be basing some of my tests on the official getting started notebook, but I'm going to be running this locally. So, we're going to just go line by line. Let's start by doing some imports, which all work, so they are installed correctly. I can also do Nvidia Smi. This lets me see that I do have a GPU on this machine. Also, my terminal really like NV top because then you can see over time the GPU usage. So, let's load this up and we'll switch over to it when we run stuff on GPU. Now, they have in their example this data set that I'm going to download. And just to see the size, let's do DF shape. And we can see it's over half a million rows of data. So, it's pretty decently sized. And if I do a DF info on it, it's not too huge. about a quarter of a gigabyte in size in memory. Let's go ahead and do a CPU training model. And we're going to do this by just doing a train test split with 20% of the data held out for validation. And then let's go ahead and time training this random forced classifier from sklearn using my CPU. One thing to keep in mind, my video was kind of stuttering there because it was using a lot of CPU when it trained this model. Now I am running a machine with 64 threads. So this is pretty big CPU heavy machine. If I just run it again, let's see what the CPU usage looks like. So all that green meant that each of the threads was really being maxed out to train this model. So we're kind of comparing against one of the a pretty strong CPUbased machine and it took 29 seconds. Now we're just doing this for demonstration purposes. So it's not too important to run the accuracy score. Now, if you're doing something like grid search where you're trying to find the best parameters for the model, you might be running this hundreds, thousands of times to really tune it, and 30 seconds doesn't sound like a lot, but when you're adding that up over a grid search, it could actually be hours of time that you're taking to run and train this model. Okay, the moment of truth. Let's try adding this special magic command. And of course it doesn't work. Okay, so I'm back. So it turned out I did not have the CUDA toolkit installed, which I guess is important to run uh CUDA QML type stuff. And another thing I realized is that I need to run this before doing the imports of the libraries or it won't work. So now I did this cool Excel. You can see it's installed accelerator for sklearn. It's initialized accelerator. And now I can run the same code data processing as before, but this time when I train my classifier, it should run on the GPU. Look at that. 1 second and it was done training. So pretty impressive. So that is a pretty good amount of speed up. I do think it's important to note that they don't necessarily have every single algorithm implemented in GPU acceleration, but the ones that they do, they have some benchmarks out there that you can look at. So if you're doing something like K nearest neighbor or ridge regression, you're not going to see a huge speed up by jumping to a GPU. But something like the random force classifier that we saw, you can get pretty good benefits from running on a GPU. So, what's the big takeaways? I think it's pretty cool that you can run some of these algorithms on a GPU with just one line of code change. That's pretty slick. I don't necessarily think this is going to be a huge deal if you're training small models every so often. But where I do think it's kind of big deal is if you're doing some optimization for hyperparameters, like you're trying to figure out what best parameters are for your model and you're going to run it on a thousand different uh parameters to see which one's best, then having this type of speed up is pretty helpful. It also seems like the clustering algorithms really benefit from having the GPU, which makes sense because those can be highly parallelized and they often take a long time to run on CPU as it is. So definitely worth checking out. If you have a GPU that's on your machine, then you can just run this pretty easily. Otherwise, if you're running in a Google Collab notebook and it has a GPU available to it, then you can just run it there for free. So, I'm interested to hear what you guys think. Let me know in the comments.

Original Description

In this video we look at the newly released cuML accelerated scikit-learn. What do you think? Will you be using the GPU accelerated version of scikit-learn? #sklearn #datascience Timeline: 00:00 Intro 01:30 Env Setup 02:34 Imports and Data 03:15 CPU Training 04:38 GPU Training 05:34 Benchmarks 06:06 My Take Links to my stuff: * Youtube: https://youtube.com/@robmulla?sub_confirmation=1 * Discord: https://discord.gg/HZszek7DQc * Twitch: https://www.twitch.tv/RobCodesLIVE * Twitter: https://twitter.com/Rob_Mulla * Kaggle: https://www.kaggle.com/robikscube
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Rob Mulla · Rob Mulla · 0 of 60

← Previous Next →
1 A Gentle Introduction to Pandas Data Analysis (on Kaggle)
A Gentle Introduction to Pandas Data Analysis (on Kaggle)
Rob Mulla
2 Exploratory Data Analysis with Pandas Python
Exploratory Data Analysis with Pandas Python
Rob Mulla
3 7 Python Data Visualization Libraries in 15 minutes
7 Python Data Visualization Libraries in 15 minutes
Rob Mulla
4 Kaggle competition starter notebook walkthrough
Kaggle competition starter notebook walkthrough
Rob Mulla
5 Kaggle Competitions: A Beginner's Guide to Winning
Kaggle Competitions: A Beginner's Guide to Winning
Rob Mulla
6 Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!
Jupyter Notebook Complete Beginner Guide - From Jupyter to Jupyterlab, Google Colab and Kaggle!
Rob Mulla
7 Audio Data Processing in Python
Audio Data Processing in Python
Rob Mulla
8 Complete Data Science Project!
Complete Data Science Project!
Rob Mulla
9 Make Your Pandas Code Lightning Fast
Make Your Pandas Code Lightning Fast
Rob Mulla
10 Image Processing with OpenCV and Python
Image Processing with OpenCV and Python
Rob Mulla
11 Speed Up Your Pandas Dataframes
Speed Up Your Pandas Dataframes
Rob Mulla
12 This INCREDIBLE trick will speed up your data processes.
This INCREDIBLE trick will speed up your data processes.
Rob Mulla
13 Complete Guide to Cross Validation
Complete Guide to Cross Validation
Rob Mulla
14 Easy Python Progress Bars with tqdm
Easy Python Progress Bars with tqdm
Rob Mulla
15 Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!
Economic Data Analysis Project with Python Pandas - Data scraping, cleaning and exploration!
Rob Mulla
16 Python Sentiment Analysis Project with NLTK and 🤗 Transformers. Classify Amazon Reviews!!
Python Sentiment Analysis Project with NLTK and 🤗 Transformers. Classify Amazon Reviews!!
Rob Mulla
17 Get Started with Machine Learning and AI in 2023
Get Started with Machine Learning and AI in 2023
Rob Mulla
18 The Trick to Get Unlimited Datasets
The Trick to Get Unlimited Datasets
Rob Mulla
19 Video Data Processing with Python and OpenCV
Video Data Processing with Python and OpenCV
Rob Mulla
20 Object Detection in 10 minutes with YOLOv5 & Python!
Object Detection in 10 minutes with YOLOv5 & Python!
Rob Mulla
21 Pandas for Data Science #shorts
Pandas for Data Science #shorts
Rob Mulla
22 Object Detection in 60 Seconds using Python and YOLOv5 #shorts
Object Detection in 60 Seconds using Python and YOLOv5 #shorts
Rob Mulla
23 Machine Learning for Facial Recognition in Python in 60 Seconds #shorts
Machine Learning for Facial Recognition in Python in 60 Seconds #shorts
Rob Mulla
24 Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption
Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption
Rob Mulla
25 Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr
Detect Text in Images with Python - pytesseract vs. easyocr vs keras_ocr
Rob Mulla
26 Solving an Impossible Riddle with Code
Solving an Impossible Riddle with Code
Rob Mulla
27 Do these Pandas Alternatives actually work?
Do these Pandas Alternatives actually work?
Rob Mulla
28 Time Series Forecasting with XGBoost - Advanced Methods
Time Series Forecasting with XGBoost - Advanced Methods
Rob Mulla
29 Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)
Data Science Uncut - Data Shootout Kaggle Competition (Aug 1 2022 Stream)
Rob Mulla
30 Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)
Kaggle Dataset Creation from Scratch- Data Science Uncut (Aug 10 2022)
Rob Mulla
31 Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)
Chess Board Computer Vision AI - Data Science Uncut (Sep 7, 2022)
Rob Mulla
32 25 Nooby Pandas Coding Mistakes You Should NEVER make.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
Rob Mulla
33 DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022
DEFCON Hacking AI CTF Solution on Kaggle - Data Science Uncut Sep 11, 2022
Rob Mulla
34 More Chessboard Computer Vision AI - Data Science Uncut - Sep 13
More Chessboard Computer Vision AI - Data Science Uncut - Sep 13
Rob Mulla
35 Medallion Data Science Live Stream
Medallion Data Science Live Stream
Rob Mulla
36 Community Kaggle Competition Overview - Corn Classification (
Community Kaggle Competition Overview - Corn Classification (
Rob Mulla
37 Deep Learning Image Classification - Corn Kernels - Data Science Uncut
Deep Learning Image Classification - Corn Kernels - Data Science Uncut
Rob Mulla
38 OpenAI Whisper Demo: Convert Speech to Text in Python
OpenAI Whisper Demo: Convert Speech to Text in Python
Rob Mulla
39 Yolov7 Custom Object Detection in Python Tutorial  - Chess Piece Detection
Yolov7 Custom Object Detection in Python Tutorial - Chess Piece Detection
Rob Mulla
40 Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022
Live Kaggle Coding - Enzyme Stability Prediction - Data Science Uncut Sep, 27 2022
Rob Mulla
41 Finding Chess Cheaters with Python! - Data Science Uncut Livestream
Finding Chess Cheaters with Python! - Data Science Uncut Livestream
Rob Mulla
42 Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022
Data Science Uncut - Kaggle Community Competition & Chess Data Analysis - Oct 4, 2022
Rob Mulla
43 Flight Delay Dataset Creation (Data Science Uncut)
Flight Delay Dataset Creation (Data Science Uncut)
Rob Mulla
44 5 Reasons to Kaggle #shorts
5 Reasons to Kaggle #shorts
Rob Mulla
45 ♟️ Data Science - Chess Data Analysis
♟️ Data Science - Chess Data Analysis
Rob Mulla
46 EXTREME PYTHON & DATA SCIENCE LIVE STREAM
EXTREME PYTHON & DATA SCIENCE LIVE STREAM
Rob Mulla
47 What is Clustering in ML?
What is Clustering in ML?
Rob Mulla
48 What is K-Nearest Neighbors?
What is K-Nearest Neighbors?
Rob Mulla
49 LIVE CODING: Flight Data Exploration with Pandas & Python
LIVE CODING: Flight Data Exploration with Pandas & Python
Rob Mulla
50 Kaggle Survey vs. Twitter Sentiment
Kaggle Survey vs. Twitter Sentiment
Rob Mulla
51 If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream
If Top Chess.com Players were STOCKS - Live Coding Data Anaylsis Stream
Rob Mulla
52 Data Visualization BATTLE!
Data Visualization BATTLE!
Rob Mulla
53 LIVE CODING: Stocks & Sentiment Analysis
LIVE CODING: Stocks & Sentiment Analysis
Rob Mulla
54 Progress Bar in Python with TQDM
Progress Bar in Python with TQDM
Rob Mulla
55 Flight Cancellation Data Analysis
Flight Cancellation Data Analysis
Rob Mulla
56 Synthetic Dataset Creation for Machine Learning - Blender and Python
Synthetic Dataset Creation for Machine Learning - Blender and Python
Rob Mulla
57 The Ultimate Coding Setup for Data Science
The Ultimate Coding Setup for Data Science
Rob Mulla
58 Dataset Creation SPEED RUN - Live Coding With Python & Pandas
Dataset Creation SPEED RUN - Live Coding With Python & Pandas
Rob Mulla
59 Data Wrangling with Python and Pandas LIVE
Data Wrangling with Python and Pandas LIVE
Rob Mulla
60 Forecasting with the FB Prophet Model
Forecasting with the FB Prophet Model
Rob Mulla

This video introduces cuML accelerated scikit-learn and its potential to accelerate data science workflows. It covers environment setup, data imports, CPU and GPU training, and benchmarks. By watching this video, viewers can learn how to leverage GPU acceleration in their machine learning projects.

Key Takeaways
  1. Set up a suitable environment for cuML accelerated scikit-learn
  2. Import necessary libraries and load data
  3. Train models using CPU and GPU acceleration
  4. Compare benchmarks and evaluate performance
💡 GPU acceleration can significantly speed up machine learning workflows, making it a gamechanger for data science projects.

Related AI Lessons

You Are Not Behind. The World Is.
You're not behind, the world is still adapting to AI, and it's okay to take your time to learn and grow
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Learn how to choose between a Computer Science and Engineering career path or combining programming with a core engineering background in the age of AI
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Understand the AI hype cycle to anticipate the next breakthrough and make informed decisions
Medium · Programming
AI won’t replace scientists. It will make the current model of science obsolete
AI is not replacing scientists, but rather making the current model of science obsolete, enabling new forms of discovery and collaboration
Medium · Data Science

Chapters (7)

Intro
1:30 Env Setup
2:34 Imports and Data
3:15 CPU Training
4:38 GPU Training
5:34 Benchmarks
6:06 My Take
Up next
Motorist saved by human chain | 9 News Australia
9 News Australia
Watch →