NVAITC Webinar: Efficient Data Loading using DALI

NVIDIA Developer · Intermediate ·📰 AI News & Updates ·5y ago

Key Takeaways

The video discusses using the NVIDIA Data Loading Library (DALI) to accelerate deep learning applications by implementing efficient data loading pipelines, and demonstrates how to use DALI with various deep learning frameworks such as PyTorch, MXNet, and TensorFlow.

Full Transcript

hi good morning and welcome to this session i'm joseph yamini and i will present you the nvidia data loading library i'm the lead the engineer of the mbi technology center in italy and we'll introduce you how to improve the data loading pipeline of your deep learning workflows and why this topic has become so important over the years deep learning owes success only partially to the architecture of the networks the other reason for this is the availability of high quality data to achieve higher accuracy networks ably rely on large data sets also to avoid overfitting which is the characteristic of a network to model the training datasets too well and lack in generalization furthermore application domains which do not have access to much data such as medical images need to compass data augmentation techniques to enhance the size of the quality of their data sets however working with very large data set is not that simple as different challenges may exist if you take this typical deep learning workflow it starts from the data stored on the hard drive continues with the pre-processing phase and eventually conclude with the training in this case the goal of the preprocessing is to constantly batch the model with data but this workflow has changed a lot over the years initially before the advent of gpus most of the time was spent during the training phase the preprocessing part was not a problem as cpus were relatively slow while processing data the availability of accelerators and the capability to move data from the cpu to the gpu via cuda dramatically changed the scenario and provide a significant boost into the performance by then the main factor determining overall performance was the time spent in moving the batch to the model and under these new conditions the overall performance is linked with the time invested in moving the data from the r drive to the model fast enough to keep the gpus busy in this case you can see two gtx 580 exactly the cards that were used to train alexnet the situation became even more critical when networks could be trained across multiple gpus or multiple nodes with respect to the preprocessing phase the training of a large network became shorter and shorter to the point that the batch processing time became a big bottleneck of the entire workflow and if you also consider that normal pipelines includes different steps for instance i'm here representing two very common networks alexnet and resnet50 you see that there are a lot of tasks in to your feeding pipeline your data loading pipeline to manipulate your data before the training in the network so you might go through a resizing of the image an augmentation of the colors cropping a mirroring so there are different steps that usually take place on the cpu so we do have complex i o pipeline in most of the cases and if you take those pipelines this networks and you start doing some performance tests although the model the model are capable to scale linearly across multiple gpus what you get in in practice it's less performance so you don't really manage to get the the the roof of your performance plane because what happens here is that you are really um behind the capability of the system because the systems are not efficient enough to fit the gpus with enough data and to patch the data towards the model the main the many issues connected to the this bottleneck is the ratio between the cpu and the gpu so you can see on the left a dj x1 with volta 100 on the right it dj x2 with the 16 volt and 100 and the ratio in this case just to give you an example between the cpu and the gpus is 5 in the case of dgx1 and 3 in the case of dgx2 so this means that you have less cpus to create and build and fill up the pipeline the data pipeline so this is something that you have to really take in consideration while training your networks especially if you have a large data set and you have a deep neural network in order to overcome the bottlenecks due to the data loading nvidia developed this open source library which is called dali it's a production ready software to the rescue to solve this kind of issues the nvidia dali library sits between the training data so your art storage and the model so the goal of the library is to make the data loading pipeline be efficient enough to keep the gpu busy and batch the data to the model the dali library can be used within different deep learning frameworks from mxnet to tensorflow seamlessly so before dali was introduced this was the pipeline so the data loading pipeline from the loading of the data the decoding the resizing documentation and then training okay the images go across the top path so you have the jpeg the new decode and then you have the raw data precise augmentation and this task were performed on the cpu while the labels could be directly go from the loader to the training to the model with dali with the introduction of dali some of these tasks can be also secured on the gpu and this has two main uh effects the first is that you keep the data on the gpu memory directly okay so to start the training on and some special activities such as the resizing automation can be executed on a gpu without moving back and forth between the cpu and gpu as i was saying before it's very easy to integrate dali into most common deep learning frameworks and i will show you some good snippets on how to do that will use pytorch as a reference framework that is the framework that we are using for the mvitc toolkit so this is a a common pipeline so this is a data pipeline that you have to create you should just tell the system where the data is what you want to do with the data which kind of augmentation you want to to apply and if you want to resize or crop or do any other manipulation okay and then at the end be able to define what the data pipeline is so there is an insulinization phase where you really define how the pipeline should be structured what are the steps that the pipeline should go across okay you feel uh you read that you have to fill the pipeline with a with a file reader then you have to tell if you want to do the padding on how to decode the data for instance with the image decoder if you want to resize it or crop or normalize or maybe if you want to really change the saturation of the channels or your images then you define the graph just to tell the pipeline in which sequence the step should come okay you first read of course then you do the decoding the resizing and then the cropping and while initializing the pipeline you can also tell dali please perform if possible this task on the gpu rather than under known cpu okay and you can also say it's a mixed execution so you start with the cpu and then you conclude with each gpu after defining the graph you can really build the pipeline at this step dali simply checks if what you told into the installation phase can be really implemented so if the data is in the in the right location if you can really prop if there are all the functionalities available that you are asking for okay you build the pipeline and basically you tell dali okay this the pipe is ready to be fed and then to provide data to batch data towards the model as i was saying at the beginning you can define different pipeline operators the input of course of your data and the cpus really play a very important role here because it's a task that is performed by the cpu reading the data from the r drive up to the main memory then you can decode the data from for instance jpeg format to a row and this is an operation that can be executed partially on the cpu and partially on the gpu while loading files from the hard drive and decode them you can tell dali to resize the shape of the data in this case we are using a cropping and in interpolation the triangular interpolation to resize the data sets and then do some manipulations and transformation in this case cropping mirroring and normalization of the colors for instance or a coin flip and then you finally you run the pipeline what you basically do you create a pipe then attach an iterator to that pipe and start iterating over the batch okay in your training loop and as soon as you iterate dali loads the data from the storage and unfit the model so but dali also works in a multi-gpu environment what does that mean when you read the data from the storage you have to tell that me where to place that specific patch and you can do that using the the chart a sharding technique you have to basically tell dali for instance if you're using uh orobot each uh task will take an id a device id okay to be identified within the the pool or the gpus that you are using you can tell dali please i am the device number one for instance give me sure number one or how many shards should i receive so you as you can see if you have a data set you might have four different charts because for instance you are using four gpus to run the training on and according to the colors that specific chart will be assigned to a specific gpu in this case shard zero green will be placed on gpu number zero and so on and so forth so one one two two three three okay this is a technique to tell dali or not to distribute the data across your resources and of course you can do you can you can manipulate or you can do more advanced you can implement more advanced techniques for data distribution okay this is the the the common one the basic one the sharding what is important with the with dali that you can also do prefetching so you can start prefetching data while the training of your network is going ahead and why this is possible because this is a common uh let's say step of sequence that you go across every um every epoch so you do uh batch pre-processing the forward pass the backward pass the gradient reduction and weights update for instance this is the step number zero for instance this is the first step that you go across but why you can while you are doing the the forward and the back word propagation for instance you can start prefetch data for the next step okay for the step x plus one so this is something that dali is capable to handle so it's capable to start fitting the data keeping the data from the storage to fit the the next step this is the reason why this is possible if it goes the backward pass is executed on the gpu while the the initial step the loading for instance of the data can be executed on the cpus that in this specific moment are idle what happened within dali is that dali is capable to create different buffers that you can tune how much big this should be to create the cues prefetching cues so training one training two and training t three these are the steps of your training loop you can start doing batching for instance you can create a sort of parallel pipeline for prefetching data but dali can not only read images from the disk and batch into the into the tensor into the model you can also perform various recommendations on data to improve the the results and i'm showing you a paper here that did the performance across three different uh networks inception version 4 rest net dense net as you can see in orange is the result is the accuracy that you get without augmenting the data so you have the orange and the gray area without augmenting the data but if you use the augmentation with the training and augmentation with the testing you see out the the accuracy uh increase so the blue line is the augmentation including training and test data the orange is the augmentation with the training and augmentation without using any augmentation um steps so the documentation is a technique that is very important especially in those domains where there are no show images or data available such as medical in the medical field for instance so dali supports different uh augmentation that you can build in in your pipeline in this case i'm presenting you the random shuffle or the padding for the last batch the random shelf will basically shuffle the data uh for the your the data with your batch size as soon as you read it from the storage okay and the shuffling is performed using a dedicated buffer while the pad last batch is a technique that is used when the the batch size or the the the wall training data set is larger of the number of shards so basically you cannot simply divide the data set by the number of charts by the number of iterations that you get so to avoid to avoid the dali to to create empty buffer if you do the padding you can it's capable to duplicate the data to fill up all the shards that you have okay this also improve it it's a technique a sort of data augmentation technique that can really uh improve the accuracy of your model you can see on this slide all the features that dali supports for the documentation it includes the list includes the sphering saturation the color the rotation the jittering filter the brightness the flipping the uh the contrast so these are all the the features that you can automatically enable while loading the data in terms of performance this is a training across different boxes of the rest net 15. of course the more data you have the better performance dali delivers in the sense that if you have a small model with the not so big data set dali will not provide any significant boost with respect to normal loader such as torch vision but the the more data you have especially more files you have to read from the system the more benefit you will get from dali so and in this case you can see how many images per second you can get uh from from the hard drive i mean using without a native pipeline between a native pipeline so without using dali and with dali so and if you see in the first two bars on the left this is the digix 2 where you have 16 gpus available with a lower ratio between cpu and gpu so this might be from the cpu point of view from the data loading point of view this is a very bad condition because you don't have so many cpus available to feed the pipeline but in reality what happens if you use the dali library you can really overcome this limitation and you can implement all the perfection technique all the mixed pipelines to run tasks across the cpu and on the gpus seamlessly so you can really get a factory to in performance as a as a speed up so uh just to give you um a recap of what dali does so you can it's a fast data processing library for accelerating deep learning so it's capable to create a pipeline to accelerate the way the data is moved from the storage to the to the model it's capable to combine different uh tasks including data orientation and supports of course the cpu and the gpus because you can run tasks we can decide which tasks should be executed on the cpu rather than other gpus but of course the more task you are capable to bring on gpus the faster you will go it's flexible in the sense that support different configuration different operators you can create also your own custom operators for instance if you have a dedicated data set with a specific binary format you can develop your own operator in c plus plus because dali provides a c plus plus api to get the data from the storage so this is a very important extension point we supports different natively different uh data formats the regular io or the tf records for tensorflow for instance uh coco hvac the jpeg of course and and an api in python c plus plus to to create and run your pipeline but also to create your own uh specific data loader yeah there is a lot of material available online with a lot of examples that you can get profit from and how to read the data in various framework autocreate customers operator and how to build the pipeline and more importantly how to combine the multi-gpu aspect with the with the data loading pre-processing phase so this is what we did in our media ai technology center toolkit uh that will be made available so we try to combine all these features uh together and ensure how they can really bring performance boost to your workloads and with this slide my presentation hands many thanks for listening and please reach out for any question thank you very much

Original Description

Learn how to accelerate DL applications by implementing efficient data loading pipelines through the DALI library. Learn more: https://developer.nvidia.com/DALI
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from NVIDIA Developer · NVIDIA Developer · 0 of 60

← Previous Next →
1 Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing
Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing
NVIDIA Developer
2 Ray Tracing Essentials Part 3: Ray Tracing Hardware
Ray Tracing Essentials Part 3: Ray Tracing Hardware
NVIDIA Developer
3 Ray Tracing Essentials Part 4: The Ray Tracing Pipeline
Ray Tracing Essentials Part 4: The Ray Tracing Pipeline
NVIDIA Developer
4 NsightGraphics 2020 2 Release Spotlight
NsightGraphics 2020 2 Release Spotlight
NVIDIA Developer
5 Ray Tracing Essentials Part 5: Ray Tracing Effects
Ray Tracing Essentials Part 5: Ray Tracing Effects
NVIDIA Developer
6 Ray Tracing Essentials Part 6: The Rendering Equation
Ray Tracing Essentials Part 6: The Rendering Equation
NVIDIA Developer
7 Ray Tracing Essentials Part 7: Denoising for Ray Tracing
Ray Tracing Essentials Part 7: Denoising for Ray Tracing
NVIDIA Developer
8 Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)
Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)
NVIDIA Developer
9 Announcing Cloud-Native Support for Jetson Platform
Announcing Cloud-Native Support for Jetson Platform
NVIDIA Developer
10 JetsonTV: Build your next project with NVIDIA Jetson
JetsonTV: Build your next project with NVIDIA Jetson
NVIDIA Developer
11 Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression
Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression
NVIDIA Developer
12 Nsight Systems Feature Spotlight: OpenMP
Nsight Systems Feature Spotlight: OpenMP
NVIDIA Developer
13 Isaac Sim 2020: Deep Dive
Isaac Sim 2020: Deep Dive
NVIDIA Developer
14 NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale
NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale
NVIDIA Developer
15 NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
NVIDIA Developer
16 Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing
Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing
NVIDIA Developer
17 Synthesizing High-Resolution Images with StyleGAN2
Synthesizing High-Resolution Images with StyleGAN2
NVIDIA Developer
18 NVIDIA Robotics: Isaac SDK and Sim 2020.1
NVIDIA Robotics: Isaac SDK and Sim 2020.1
NVIDIA Developer
19 Accelerating COVID-19 Research with GPUs
Accelerating COVID-19 Research with GPUs
NVIDIA Developer
20 Visualizing 150 Terabytes of Data
Visualizing 150 Terabytes of Data
NVIDIA Developer
21 Boosting Performance and Utilization with Multi-Instance GPU
Boosting Performance and Utilization with Multi-Instance GPU
NVIDIA Developer
22 Running Multiple Workloads on a Single A100 GPU
Running Multiple Workloads on a Single A100 GPU
NVIDIA Developer
23 NVIDIA Nsight Feature Spotlight: GPU Trace
NVIDIA Nsight Feature Spotlight: GPU Trace
NVIDIA Developer
24 Spark 3 Demo: Comparing Performance of GPUs vs. CPUs
Spark 3 Demo: Comparing Performance of GPUs vs. CPUs
NVIDIA Developer
25 NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award
NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award
NVIDIA Developer
26 NVIDIA IndeX on Google Cloud Platform Marketplace
NVIDIA IndeX on Google Cloud Platform Marketplace
NVIDIA Developer
27 DeepStream SDK: Best practices for performance optimization
DeepStream SDK: Best practices for performance optimization
NVIDIA Developer
28 Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing
Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing
NVIDIA Developer
29 NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI
NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI
NVIDIA Developer
30 NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely
NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely
NVIDIA Developer
31 Advancing AR Glasses
Advancing AR Glasses
NVIDIA Developer
32 Blender Cycles: RTX On
Blender Cycles: RTX On
NVIDIA Developer
33 Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding
Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding
NVIDIA Developer
34 Assessing Property Damage with AI
Assessing Property Damage with AI
NVIDIA Developer
35 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
NVIDIA Developer
36 DaVinci Resolve Turns RTX On
DaVinci Resolve Turns RTX On
NVIDIA Developer
37 RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization
RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization
NVIDIA Developer
38 NVIDIA IndeX for arivis5D Cloud Platform
NVIDIA IndeX for arivis5D Cloud Platform
NVIDIA Developer
39 NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX
NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX
NVIDIA Developer
40 NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse
NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse
NVIDIA Developer
41 How to Create "Paint" in Substance Painter
How to Create "Paint" in Substance Painter
NVIDIA Developer
42 Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI
Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI
NVIDIA Developer
43 Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU
Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU
NVIDIA Developer
44 Accelerated Data Centers with NVIDIA and VMware
Accelerated Data Centers with NVIDIA and VMware
NVIDIA Developer
45 GPU-Accelerated Motion Blur in Blender Cycles
GPU-Accelerated Motion Blur in Blender Cycles
NVIDIA Developer
46 NVIDIA Clara Guardian Virtual Patient Assistant
NVIDIA Clara Guardian Virtual Patient Assistant
NVIDIA Developer
47 Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI
Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI
NVIDIA Developer
48 Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
NVIDIA Developer
49 Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion
Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion
NVIDIA Developer
50 Getting started with Jetson Nano 2GB Developer Kit
Getting started with Jetson Nano 2GB Developer Kit
NVIDIA Developer
51 NVIDIA Jetson Developer Community AI Projects
NVIDIA Jetson Developer Community AI Projects
NVIDIA Developer
52 Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit
Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit
NVIDIA Developer
53 Real-Time Ray Tracing with Project Lavina
Real-Time Ray Tracing with Project Lavina
NVIDIA Developer
54 Jetson AI Fundamentals - S1E2 - Hello Camera
Jetson AI Fundamentals - S1E2 - Hello Camera
NVIDIA Developer
55 Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100
Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100
NVIDIA Developer
56 Jetson AI Fundamentals - S1E4 - Image Regression Project
Jetson AI Fundamentals - S1E4 - Image Regression Project
NVIDIA Developer
57 Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware
Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware
NVIDIA Developer
58 Jetson AI Fundamentals - S2E2 - JetBot Software Setup
Jetson AI Fundamentals - S2E2 - JetBot Software Setup
NVIDIA Developer
59 Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack
Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack
NVIDIA Developer
60 Jetson AI Fundamentals - S1E3 - Image Classification Project
Jetson AI Fundamentals - S1E3 - Image Classification Project
NVIDIA Developer

Learn how to use the NVIDIA Data Loading Library (DALI) to accelerate deep learning applications by implementing efficient data loading pipelines, and discover how to integrate DALI with various deep learning frameworks.

Key Takeaways
  1. Define the data pipeline with various operators and steps
  2. Tell the system where the data is and what you want to do with the data
  3. Apply augmentation and resize or crop images
  4. Define the graph to tell the pipeline in which sequence the steps should come
  5. Build the pipeline and check if it can be implemented
  6. Create a pipe and attach an iterator to it
  7. Start iterating over the batch in the training loop
💡 DALI can perform tasks such as decoding, resizing, and augmentation on the GPU, reducing the need to move data back and forth between the CPU and GPU, resulting in significant performance boosts for deep learning applications.

Related Reads

📰
Hyundai and Kia built a UV system that kills bacteria inside a car while you are sitting in it
Hyundai and Kia develop an in-vehicle UV system to kill bacteria and viruses while passengers are present, using far-ultraviolet light technology
The Next Web AI
📰
The latest AI news we announced in June 2026
Get the latest AI news from Google's June 2026 updates and stay current with industry developments
Google AI Blog
📰
AI-Powered Theodore Roosevelt Is Ready To Answer Your Questions
Learn about the AI-powered Theodore Roosevelt avatar at the presidential library, which showcases innovative applications of AI in education and history
Forbes Innovation
📰
Krafton agrees to pay Subnautica 2 bonuses after CEO who used ChatGPT to dodge them steps down
Krafton agrees to pay bonuses to Subnautica 2 staff after CEO steps down, highlighting the importance of transparency and accountability in leadership
The Next Web AI
Up next
FABLE 5 IS BACK
Wes Roth
Watch →