Scaling Python Analytics: NVIDIA cuPyNumeric and Legate Boost for HPC | NVIDIA GTC D.C.
Skills:
ML Pipelines80%
Key Takeaways
Scales Python analytics workflows using NVIDIA cuPyNumeric and Legate Boost for HPC
Full Transcript
Hello everyone. Thank you for coming to scaling Python analytics. We're going to take a look at some pretty interesting stuff. How to scale your numpy scripts, how to scale your pandas scripts, how to scale your XG boost uh training algorithms for machine learning. My name is Daniel. I'm a technical program manager for the CUDA Python team. I work in the many Python libraries that we have to in order to scale CUDA from single node as well. core CUDA um kernels uh up to like thousands of GPUs and that's what we're going to be taking a look today which is distributed analytics. Um so what you will learn by the end of the talk you'll learn to know run how to take your an existing nai script and run it in not only a single uh CPU or single GPU but how to scale it to single node multiple GPUs with basically no code changes and many many nodes and many many GPUs up to like tens of thousands of GPUs if that's something that you have access to these days. Uh the second thing you're going to learn about how to use li data frame and liate boost to scale more like uh ETL and uh machine learning uh analytic pipelines. So hopefully the name of the libraries give you an idea what they are data frames on top of liit and boosting algorithms on top of lietate and we'll take a look at what liate is a little bit in a minute. uh and how do you the leaked Python libraries allow you to scale like I said from single CPU, Mac, Windows, whatever you have locally can develop there. you can take that and scale the same script into a single GPU for example a DGX spark the little uh golden new thing that we're selling or like any u graphics GPUs one of the more data center GPUs H100s and if you have a DGX machine with multiple of those can still use those and if you have a data center or a supercomput you can use the same libraries in the same code um with basically no code changes uh just a little bit like how you deploy it and we'll take a look at all these details uh we'll see how you can keep doing the same um functionality you're doing these days like data cleaning feature engineering linear algebra with numpy Monte Carlo simulations optimization solving of equations all those different things you have the same APIs and you'll be able to scale those um to like terabytes and even terabytes of data right that's that's the the important thing and so I wanted to ask how many lines of code do you think you need to change today without any of the stuff that I'm going to show you if you need to take a numpy script and run it on one GPU, right? That's a question that is is pretty common. You could do that actually pretty easily to a single GPU using Kupi. You can just change the numpy uh import to coup and in a single GPU that has a really good chance of working, right? Um so, uh yeah, so what happens if you want to do multiple GPUs? Now, coup is a library, but it's only single GPU. So, you're not able to leverage that to do that. you might need to start doing manual chunking of your data that you're reading your parquet files, HDFI files. Um, and it's not really easy. Uh, and what happens if you don't have multiple GPUs? You only want to escalate to multiple CPUs. That's something that uh today you can maybe solve with Dask, you have to like start using DAS data frames or maybe using Spark for the Spark data frames as well. It really changes the workflow that you have today and maybe they're not the system that you like working the most. you like to write very core CUDA uh core Python code native Python code and you don't want to deal with the distributed systems uh interfaces right um so the important thing is like from CPU to GPU it's actually pretty easy these days for example for boosting algorithms you can do X boost and they have really good support for that we'll take a look at that later uh for uh data frame stuff we have in the rapid team KUDF pandas so it's basically a drop placement and actually for pandas and even from uh polars now we have again from the rapid team many ways to uh scale that even for like single node multiGPU that don't have a storage yet for like distributed uh um multiple nodes basically right um so what if I told you that you can actually reduce drastically the amount of lines of code that you need to change you still need to think about the deployment of the of the system of course we cannot really solve anything with like a mag you want but at the actual logic the actual mental model that you have to um to keep in mind is the same that you do for like single CPU that you're used to with all the Python libraries in the py ecosystem. Right? So I'll start with this is this there's one thing that you uh take out of this session is that you can take a numpy script change the import. So instead of doing uh import numpy SMP you can do import coupy numeric SMP and most of the time this is going to run right and if you want to run it in for example in my Mac I just do Python my program.py and it runs on a single CPU totally fine. If I have a GPU of a server with four GPUs for example, I change the execution command. I do liate-gus four and it's going to use four GPUs that I have in my system. Same uh script, nothing really changed and it started to leverage the GPU for computing. And if I have multiple nodes, I just need to basically run the same command in many nodes and it will take care of like communicating between those do all the data shuffling, the data loading, partitioning. You don't have to think about any of that. In this case, I have an example running with a SLO. So, S run- N 10. So, 10 nodes. You can put more flags if you want. Run the leg command GPUs my program. In this case, it will running 80 GPUs, right? Important, you don't need to use GPUs. We also have CPU support, right? So, you can do like for example 10 nodes. So, distributing 10 CPU nodes as well. Uh, which is something that we don't advertise as much, but I think it's pretty cool. You some a lot of people don't have access to thousands of GPUs, but sometimes they have access to thousands of servers. for example in AWS. So uh it's something that we can scale as well and once you get access to GPU you can use that or use that for one part of your pipeline. Usually people do it for like the machine learning pipeline in XG boost uh using leg boost right. Um okay so before we continue with the other libraries I wanted to talk about about what want to make clear what these libraries are what they and what they are because they sound complicated and people get into people tend to get confused a little bit by those so coup numeric things is not a dropping replacement for numpy right so don't expect that everything is going to work 100% the same especially when you go distributed right when you go a single node things mostly stay the same way. You can use mat leave, you can pass it to scikitlearn. We have decent support for the array API even if it's not complete. Um but things that uh when you go distributed the way that the distributed engine works uh you're not going to be able to just pass a coupon numerical array to map li and plot that. That's not going to work 100% right. Uh because just it's a distributed array. Yeah, you're not going to be able to plot that natively. You have to convert to numpy array which is something that a lot of people do. But ki numeric does implement the numpy API. So the high level APIs that people really use day-to-day we implement all of those. So most of the times just change and import if that's what you want to do. I recommend for example to keep both. I recommend to keep the numpy import and import kai numeric as something else cn cmp and use both of those as the uh as the program u that you're developing the program right uh it is not kupai right it's has similar name kai numeric and kai like I said before kupai is single GPU it's a different completely different project. Koopa is not owned by Nvidia. Koopa numeric and Nvidia project. We support the development of that and the whole stack that is below that. Um like I said they have array array API compatibility that's like an Python and numpy uh API for that. Many of the libraries that use numpy like saklearn and deep and all the libraries use that as a way to you know um pass the the right information from numpy the data to the compute that they do. We don't have full compatibility with that yet. Uh but we will have uh in the future legal data frame um it is not like the dropping replacement from the rapid steam that you some of you might know. So it's not like KDF pandas, it's not like KDF polars. It is its own implementation, right? It does use KDF of course in behind the scenes. Um but it's a implementation of distributed uh compute of data frames works in CPU, works in GPU using KUDF. Um and right now we also have a polar API. So the default API is more like UDF or is more verbose. It's not the nicest one. But now we have a polar implementation of that as well. We'll take a look at an example. Uh liate boost is boosting um algorithms implemented on top of liate. Hopefully that's kind of clear. You can think about an an XG boost or GBN replacement that is multiCPU multiGPU and all the combination of those. Um all the libraries in liate are CPU and GPU native. Again really important. We talk a lot about GPU because that's when you get a lot of performance improvement, but like I said, this also work in CPU and I think that's that's kind of nice. Um, you're able to keep your nonpay and pandas or polars or x boost mental model, right? So, it's basically you're executing every single every single line of code. So, the way that you execute is that you expect it. It's not like uh Spark is lazy loaded and you can optimize a little bit more sometimes, but the execution is not what you were expecting from like a regular Python script. uh it is Python which is important because and I some people get confused sometimes there is like other languages that claim to be Python but they are actually not Python and they tend to like say oh we optimize CPU and GPU and distributed stuff but then you have different Python syntax and stuff like that they mimic some of the Python APIs this is actually like a Python library that you install is the same Python runtime that you use so you install this from cond or pippi as you would with any other library of course there's a C++ library behind that many libraries actually uh below that uh but we use those so it's actual Python it's not like other languages that are out there um and the other important thing is that it's optimized for scale right so if you have some data set that you know is never going to grow beyond a couple gigabytes and you can process totally on your workstation you might not need this right it's going to run yeah we support that and we do a lot of examples with that but it's optimized for scale it's optimized for like the thousands of GPUs not you don't have to get to thousands but like many many GPU scale basically uh and that's where it shines. Um okay, so I mentioned this a little bit briefly. I can go from like my laptop single CPU that's what I do my local development import kai numeric lo my HDFI files do my data mon whatever import polar read my parquet files I develop maybe with a small subset of data if I have a DGX machine because I was looking I wanted the raffle of GTCDC I can do that there um I have for example two 4090s that Nvidia provided to me and I do that in my local workstation and I have two GPUs and I I distribute work there between those two very easily right that's the second part that accelerated uh work. Uh so actually using GPUs then when you go to like multiGPU servers like a DGX machine with many H100s or Gracehoppers, you can also do that same code uh just different execution a little bit ex execution command and if you have access to like a supercomput you can scale to like like I said many times uh hundreds and thousands of GPUs, hundreds of thousands of nodes. [clears throat] Okay, before we go to like the other libraries, let's talk a little bit about how um the the stack uh works. Hopefully, most of the times you won't need to see any of these, but you know, things happen and you'll see some vision error, some real error. Every now and then once you're trying to execute this in the same way that you do a spark, you're going to see a JVM error. Nothing you can do about it. Uh once you go to multi-distributed nodes, like you're going to see these type of things. uh but liate is our composibility layers to develop all the libraries. So all the libraries scoop numeric liate data frame and build on top of liate that is that liate uses legion for the parallelism. So that is the taskbased execution engine basically similar to like spark and and that they have like these graphs whatever that's what legion does and realm is the actual execution um the one that's actually executing. So it does like takes the chunks of the data that didn't passes and executes those uh in either CPUs or GPUs right so it's agnostic it's agnostic agnostic of those things um and again we can target um heterogeneous architectures of like from single nodes to supercomputers right um so just uh uh uh some details there and then we have many libraries implemented on top of that on top of li right so we have kernic lip sparse for like sci-p arrays li data frame and liio which is very important right look so you how how do you read and write many files uh and file formats uh and a little bit more details there I don't want to spend that much time here um is uh how this actually works uh to to like execute the the the task right so you have your script you import these libraries you write your code then leg takes care of like looking that and converting those APIs to tasks and the legal runtime optimizes that you have seen this a million times with the spark and dask there's a small difference here we don't optimize the whole graph we optimize basically layer by layer right so it will be like for example from the a um circles to like bcc right so it would optimize layer by layer because we're actually executing uh as the program goes right that means that sometimes we cannot do the most optimization for like some u data movement but we're sometimes also able to bypass that with some of the query optimizations for for for from polars for example right so there are many ways to optimize these things and you can actually tell liate some hints about the delay of your data and whatever to to help optimize some of these things, right? Okay. So, uh leg dataf frame hopefully the name gives away uh what that is. Uh is a dataf frame implementation on top of liate right. So, uh this is the naive um base implementation of the API. You see at the very beginning I import parquet read and park right because I'm going to read paret files and I'm gonna uh write parket files. I import a join operation and I import some binary uh and data data date manipulation operations. Right? What I'm doing here is a join of those two parket tables. So very very simple. This API mimics the kudf API a little bit. It's not the nicest like this binary operator and like this extract time stamp components. It's not the nicest APIs, but like I mentioned, now we have a new Polar API. So, this is the same code that I had before. In this case, because it's just a join, doesn't change much, but you can imagine as I have more complex queries, uh the Polar API is nicer to to use. So, I import regular polars and then I import uh our polars um execution engine. Basically, that makes a hook into that. And at the end, the only thing that I change is instead of like doing data process.colct, collect I do data process data process data count that liitolct so that leakage is going to use li to execute that particular query uh we don't have support for the full pandas for the full polars API we have support for a lot of it uh we're able to run um all but two tpc queries and once we are able to run all of them we'll probably make a blog post announcing some announcing that and some benchmark numbers that we have uh but today you're already able to do a lot with this API um especially when you when we go to the streaming functionality which is the autocore functionality that I'll talk about later that's when it gets a little bit more tricky uh but we'll eventually implement everything and do more benchmarks uh and more API like TPCH and TPCDS again very important the code stays exactly the same you have to just change the way you call the code right so in my local development I just do Python program py it runs on my CPU like it runs with pandas same name uh in the two middle um squares, right? Uh if I only have one GPU, I can do liget- GPU1. I actually just ignore it. It will take all the GPUs and run it in my program and it will run it on that GPU, right? Um I can, for example, also say like, hey, only use uh 50% of my um um VRAM of my GPU, right? I can tell that and if it doesn't fit, it will it will crash, but it will just respect that, right? Uh same thing if you will have single node multiGPU you don't have to change anything just limit GPUs 8 and it will use all the GPUs to distribute the compute so your data doesn't fit on one GPU you can use more GPUs um and then when you get to like the multiode cluster you need an execution engine I have example for uh is but we have support for many other things like kubernetes or whatever we'll talk about that at the very end I can tell it like running 10 nodes eight GPUs 80 GPUs in total and it will run the code in execute the code in those. So very very nice. And again one thing here I don't mention here and I should have probably um is multiCPU support. Again you don't have CP you have GPUs you can do in multipu and I think that's pretty cool. There is no currently that I know an open source version of like distributed polars right for CPUs. Um I think they have one in the private in in the private offering that they have in the cloud but open source there is no one way to distribute polars in CPUs. So actually I think that's pretty cool. One of the nice things that we usually don't talk much but you can scale that pretty quickly uh using using liquid data frame you can think of that as a as a narrow a distributed apachiaro execution engine in some ways as well which is pretty cool it's an open source version yeah everything here is open source of course and you can download it today um what happens I mentioned that if if I restricted my uh my vramm on my GPU and I have too much data to load it's going to crush right that's not ideal but that's something that happens there's a lot of large data sets. Sometimes I don't have or most of the the time I don't have enough GPU memory to load those data sets um that's where we have this streaming mode or out of core mode basically. Uh so if you are familiar with dask and spark it works in a similar way um it's not so the code inside the basically the the only thing I need to change is create this uh python um context manager right uh which is this with the scope parallel policy blah blah blah streaming I enable streaming and I overcompose my computation in this case by 64 right so if I have very big parket tables it's going to chunk those table respect the row row groups and so on um do the join and then at the end I'm writing a HDFI file, right? So chunk by chunk is going to read the parcet, do the join, do whatever it needs to do to like um do the join correctly. Sometimes you have to rec computer whatever and write HDF5 um an HDF5 array a file which represents the array that I can load later in liet boost which is what I will show uh in just a moment right uh but that is the way that you can go beyond uh like you can compute a lot of data in a single GPU or multiGPU right uh the more GPU memory you have the better but if you don't have enough to load your terabytes of terabytes of 4K files you can use this streaming mode and this is new functionality that we have developed with some of our customers that that do this. Um, one thing here like some of this most most of the time here like depending on like how um how like close to like a regular CPU deployment you go, you won't get as much benefit, right? Like I said, these libraries are optimized for a scale. They're optimized for like you're able to load as much data as you can in memory in GPU memory and execute very quickly on those. Uh but I've been able to process I think the last the last last the last one I did was 17 billion rows of parquet file in I don't remember 100% because of one GPU or a GPUs but it was a single node there's no way that fits in a single node and I was able to protic in 10 minutes um so it was it's pretty impressive right training train an an XG boost model not just doing the joint so doing like both things uh so something very nice um leg boost okay so that's the library hopefully the name gives you a way what it is boosting algorithms implemented in ligan boost and if we go back to like the the first question that that I asked at the very beginning like how many lines of code you don't need to read this code it's is intentionally uh small um how many lines of code do you need to change today to scale uh xg boost training right the most popular library and one that we support at Nvidia the rap from the rapid team is the xg boost right so the on the left we have the single GPU And that's super easy. You just have to add a parameter device equals to CUDA and it will last that GPU. And if you have multiple GPUs, do call CUDA colon the device that you want to target like as an index CUDA col 0 one that is for single GPU. Um what if your data doesn't fit uh in your GPU like we saw before with the streaming, right? That's what the XG boost external memory uh feature that I think they introduced that in 3.0 does, right? So you have to write an iterator and you have to chunk the files that you're reading and pass them to XG boost and I mean it definitely works right I just had to write a lot more code and I have like dot dot dot in many places here it's not even the full code but I had to add many lines of code compared to my first example right what if I wanted to do and that's also for streaming or auto core but also some single GPU if I want to do a multiGPU in a single node I need to add even more code so this code is also code that is not there so you can basically add those lines of code and even more that I don't even completed here uh to do multiGPU training on a single node right you have a lot of example basically the way this works in a boost is a multi-processing um sync uh and you're able to train those things and if you want to do go multiode you can use something like dask but in dask you are a one of the requirements today is that you need to feed the input to xg boost in distributed memory right so there is no autocore but you can see there's many ways to do this in x boost you to learn all of them, you need to develop all of them and debug all of them. Right? In liicket boost, there's only one way to do it and the only thing that you have to change is the way that you execute it. So this is the only code that you need to do is basically mimic the sacklearn API. I generated some random data and I pass them to this this regressor, right? I don't have the full like all the variables here, but that's the idea that that you can take of right important boost and pass the data there. I have some benchmark numbers uh about like a scaling. This is weaker scaling of just u le boost. Basically we have 20 million uh columns per GPU. Uh and then we go from like eight GPUs to 1,24 GPUs and you can see that for most base models the the scaling the week scaling is kind of linear at the end. It kind of explodes which makes a lot of sense because of all the communication that that is happening uh on the cluster right. Uh but you can see again the only thing that you need to change in the code is how you call it. So single CPU Python training LBY single GPU le GPUs one single native GPU GPUs 2 and then multiode version you can do is slur that's the one that use the most but you can do it in many ways just call the the the script multiple times and it will just communicate and work uh out of the box. Um then I have like a little bit of code here like how does it look like a for a full pipeline like full analytics pipeline right like all the nia libraries are nice but they are composable they're all developed uh to work together so I can for example load um data using liate data frame and this is the john that I showed a couple of times so you don't have to read that again uh but then I can convert that to ki numerical arrays and I can pass those ki numerical arrays to liate boost in the same way that I could read from pandas convert to numpy and pass those numpy arrays to xg boost similar ideas so we want to match the mental model that people know and like from Python and the PI data ecosystem today, right? Uh so same thing uh what happens if I want to do like a stage pipeline, right? Like I have a lot of data that I only need to pre-process once and then train multiple models and multiple XG boost models li models based on that. I can just use leg data frame like I showed read data join do whatever filter group by whatever I want to do uh write to HDFI files as an intermediate file and then do multiple training um passes in X boost and train many many models. Um okay so deployment this is this is I I talk about like hey the code is stays the same but deployment you have to think about it how you actually distribute the execution right so the deployment of all these in multode requires NPI for now that's a dependency we use it for some very like initialization of the of the processes we're actually working on removing that dependency um and it will come in the future uh I showed a slm and a lot of people tell me that they love a slurm um but I Actually, I don't think anybody ever told me that they love Storm. Uh they people use Kubernetes, right? And they want to do it in the cloud. Um so we have Docker containers and you we have package Python packages, quantum package, we have everything. So you can build this yourself and deploy it. In Kubernetes, for example, you need today the NPI operator which is not ideal and I get it. Uh that's what we're trying to remove that dependency. Uh eventually we'll have like a more um you know IP port type of thing and schedule client architectures. right now you need to do it with NPI mostly because uh most of our customer are HPC customers they have that already and it's something that is not that hard for them you can use this today uh in the cloud like coil is a cloud deployment company uh they started doing d deployments but they do many things now and um you can actually run this using their npi command so you can do coil npi run liate and it will actually they have a VM basically with all the dependencies and they take off like doing this and it runs in your AWS So I know a lot of people run this in in the cloud and you can do this. Uh one nice thing I actually run these uh multi-CPU um multiode polars right I'm super excited about it because I always wanted like a multi CPU thing that was easy to use uh for data frames. Uh this actually runs in the cloud today using uh using code but everything works of course a boost and and whatever. Um okay so uh we had a couple of extra sessions we we talk a lot about uh coupon numeric in the previous session. So if you missed that there's more details just about kupi numeric and the use cases that we have been able to optimize with that. Um we had cuda python uh kernel training. So if you were there thank you. Um and I think you can also catch those later as well. We have the connect with experts just I think after this session. So I'll be there if you have more questions about this or just any cuda python questions. um as well. Um yeah, and uh yeah, we also have the CUDA 13 uh features. Uh that that was I think today as well. If you miss that, that's really good uh to see all the new actual core core CUDA stuff. How do you get started with this today? Um you can just go install it. That's the easiest way and you just add a channel. So dash categ and all the libraries are there coupon number boost. That's the easiest way to do it because it include like the CUDA toolkit and all the things into that environment. We also have Pippi packages. So u you can just tell Nvidia- Cup numeric and it will be there. Uh again you can just run this at le GPU one. So if you have one computer with one GPU you can just try this and and let me know how it goes. Uh and we have documentation for all the libraries as well there. And yeah uh I just wanted to like invite everyone to join us in accelerating the Python computing. We believe it's really important to take the ecosystem that people like that people are running infinite amount of jobs these days using the libraries that they like and we want to be able to accelerate for the whole road spectrum from like single CPU multiCPU GPU multiGPU multi multiGPU and all the combinations that you can think of. Uh so try it and join us in this journey. Thank you. >> [clears throat] >> Hey um quick quick question on the uh on the out of core training or out of core processing. Y >> is this only supported on the gra hopper and crazed blackwell architecture or regular blackwells? >> Yeah the regular you don't need gracehopper to do that. I think it it helps a lot with the unified memory if you have like something like that. Yeah, it helps a lot with that uh type of thing, but you don't require one of those. I do this on my regular um >> x86 chips and GPU chips. Yeah. Cool. Yeah. >> Thank you, Daniel, for such a wonderful session today.
Original Description
Learn how to seamlessly scale your Python data analytics workflows from laptop to supercomputer using NVIDIA cuPyNumeric and Legate Boost. We'll demonstrate how cuPyNumeric implements the NumPy API and allows you to scale your workloads across multiple nodes and GPUs, while Legate Boost enables distributed computing for popular analytics libraries.
Speakers:
Daniel Rodriguez, Sr. Technical Product Manager, NVIDIA
Watch more: https://www.nvidia.com/en-us/on-demand/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from NVIDIA Developer · NVIDIA Developer · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing
NVIDIA Developer
Ray Tracing Essentials Part 3: Ray Tracing Hardware
NVIDIA Developer
Ray Tracing Essentials Part 4: The Ray Tracing Pipeline
NVIDIA Developer
NsightGraphics 2020 2 Release Spotlight
NVIDIA Developer
Ray Tracing Essentials Part 5: Ray Tracing Effects
NVIDIA Developer
Ray Tracing Essentials Part 6: The Rendering Equation
NVIDIA Developer
Ray Tracing Essentials Part 7: Denoising for Ray Tracing
NVIDIA Developer
Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)
NVIDIA Developer
Announcing Cloud-Native Support for Jetson Platform
NVIDIA Developer
JetsonTV: Build your next project with NVIDIA Jetson
NVIDIA Developer
Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression
NVIDIA Developer
Nsight Systems Feature Spotlight: OpenMP
NVIDIA Developer
Isaac Sim 2020: Deep Dive
NVIDIA Developer
NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale
NVIDIA Developer
NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
NVIDIA Developer
Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing
NVIDIA Developer
Synthesizing High-Resolution Images with StyleGAN2
NVIDIA Developer
NVIDIA Robotics: Isaac SDK and Sim 2020.1
NVIDIA Developer
Accelerating COVID-19 Research with GPUs
NVIDIA Developer
Visualizing 150 Terabytes of Data
NVIDIA Developer
Boosting Performance and Utilization with Multi-Instance GPU
NVIDIA Developer
Running Multiple Workloads on a Single A100 GPU
NVIDIA Developer
NVIDIA Nsight Feature Spotlight: GPU Trace
NVIDIA Developer
Spark 3 Demo: Comparing Performance of GPUs vs. CPUs
NVIDIA Developer
NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award
NVIDIA Developer
NVIDIA IndeX on Google Cloud Platform Marketplace
NVIDIA Developer
DeepStream SDK: Best practices for performance optimization
NVIDIA Developer
Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing
NVIDIA Developer
NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI
NVIDIA Developer
NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely
NVIDIA Developer
Advancing AR Glasses
NVIDIA Developer
Blender Cycles: RTX On
NVIDIA Developer
Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding
NVIDIA Developer
Assessing Property Damage with AI
NVIDIA Developer
RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
NVIDIA Developer
DaVinci Resolve Turns RTX On
NVIDIA Developer
RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization
NVIDIA Developer
NVIDIA IndeX for arivis5D Cloud Platform
NVIDIA Developer
NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX
NVIDIA Developer
NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse
NVIDIA Developer
How to Create "Paint" in Substance Painter
NVIDIA Developer
Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI
NVIDIA Developer
Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU
NVIDIA Developer
Accelerated Data Centers with NVIDIA and VMware
NVIDIA Developer
GPU-Accelerated Motion Blur in Blender Cycles
NVIDIA Developer
NVIDIA Clara Guardian Virtual Patient Assistant
NVIDIA Developer
Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI
NVIDIA Developer
Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
NVIDIA Developer
Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion
NVIDIA Developer
Getting started with Jetson Nano 2GB Developer Kit
NVIDIA Developer
NVIDIA Jetson Developer Community AI Projects
NVIDIA Developer
Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit
NVIDIA Developer
Real-Time Ray Tracing with Project Lavina
NVIDIA Developer
Jetson AI Fundamentals - S1E2 - Hello Camera
NVIDIA Developer
Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100
NVIDIA Developer
Jetson AI Fundamentals - S1E4 - Image Regression Project
NVIDIA Developer
Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware
NVIDIA Developer
Jetson AI Fundamentals - S2E2 - JetBot Software Setup
NVIDIA Developer
Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack
NVIDIA Developer
Jetson AI Fundamentals - S1E3 - Image Classification Project
NVIDIA Developer
More on: ML Pipelines
View skill →
🎓
Tutor Explanation
DeepCamp AI