NVIDIA Nsight Feature Spotlight: GPU Trace

NVIDIA Developer · Intermediate ·📰 AI News & Updates ·6y ago

Key Takeaways

The video demonstrates the use of NVIDIA Nsight's GPU Trace feature for profiling and optimizing graphics applications, with a focus on collecting GPU performance data and analyzing frame-level execution.

Full Transcript

welcome to the GPO dress feature spotlight and video insert graphics is a feature-rich tool the developers can use to debug and profile their graphics applications in addition to providing a frame debug of high profile on the ability to sell frames as a simple as fast capture there is a powerful new GPU trace low-level profiler let's talk about GPU trace GPU trace is a tool that profiles lab applications and gives a breakdown of various GPU unit utilization throughout the frame execution it currently supports digs wealth and Vulcan applications on Windows and Linux GPU trace takes advantage of the special single pass counter capability this capability was made possible by the Turing architecture which is required in order to use it so how does the putress work on our Tech's architecture GPUs are very complex and comprised of many different hardware units that each have a specialized purpose on NVIDIA GPUs there are performance monitor components for each major hardware unit known as PMS these PMS give us a good indication of the UN support and utilization when disability has been there for a while in Turing GPUs our architecture team expanded this capability and we can now collect more of this data in a single frame GPU trace leverages this capability and collected data with minimum intervention of the application execution which makes it a low overhead non-intrusive powerful profiler some form of the GPU side retrace to appear in a totalization foot boot from the application side with rack synchronization objects vocals this witch's execute command is perd acute was executed on add to these superb markers and you've got a very accurate overview of the frame execution on the GPU and a breakdown of the GPU unit utilization throughout the frame duration you have the option of refining a single frame or multiple consecutive frames let's get familiarized with GPRS once you've installed and said graphics the best practice is to create a new project so all relevant settings are saved for later use in the connection dialog choose the GPU trace activity and application executable path command and arguments and environment variables if applicable you can setup the number of frames that you want to profile let's leave the metric set to the default throughput settings we recommend keeping vsync off for real-time profiling and running with lock lock - based checked this will enable you to bet on compare traces from different trends click Launch reboot race and application will be launched it is recommended to run your application in a full-screen mode once application is ready click the f11 hotkey from a remote machine click generate ripio trace capture button to create a new trace in the trace file there are three areas of interest the Timeline view matrix and information tab and ranges table in the timeline view you can see synchronization objects barriers actions and markers and matrix information the summary tabs show top throughput information or you can switch to the metrics tab to quickly search for specific metric the range is table summarizes all ranges by type and correlate the information with both the timeline view and the metrics tab it is also possible to add the user ranges this information will be stored in the trace file and can later share with others here is the trace of Wolfenstein Youngblood before it is released this title is using Vulcan for its graphics API let's observe this trace and examine it according to the peak performance potential lines this method also known as the p3 method the first thing to notice is the GPU active which indicates the number of cycles for the graphics all compute engine were active in percent if it is lower than 95 percent it indicated was 5 of the time where the GPU was fully idle and hence it is recommended to switch first to incite systems to see what on the CPU side is limiting the performance in this example GPU active for the frame is 99% so we should continue with repeat race let's examine the trace race range GP active is one hundred percent so next we should observe the units for put top unit is VM for put which is only thirty percent it is very low and may indicate the performances latency limited by the VM to observe that we should reduce VM accesses by either increase cache it rates or reduce texture formats note that on all NVIDIA GPUs all VM traffic's goes for the l2 cache so a breakdown of what requests are made to the film from the l2 cache can really help to understand what changes are best to do to overcome the VM limiter in the full put matrix mode we do not have this information the way to obtain this data a food advanced matrix mode let's examine that vance mode so we know what is the limiting range and what unit has allowed for put but we're still not sure what to change in order to fix the issue this is why we have the Advanced Mode in GPU trace in this mode we will capture frames each time with different matrix set the additional counters collected give us a better indication of the why is this unit so poorly performing to activate it simply choose the advanced mode matrix set keep in mind that this is a longer operation but you can also change the matrix set while the application is running some no need to relaunch the application let's see what we discover when we switch to advanced mode in our Wolfenstein youngblood example to capture a new trace using the advanced mode matrix we need to open the connection dialog and set a matrix set to the advanced mode if you kept your application running after the previous capture you can also switch conflicts while there education is running and save the time of freelanced application you may notice that this operation takes longer time make sure to not move the game or freeze it if you can let's observe the results here is that when small trace of the game we immediately noticed additional sections in the summary tab with Rob issue and LUN stalled since we saw that the veeram throughput is low we want to understand better l2-cache breakdown it can give us an indication of what we need to change in our application the matrix that give this type of information are the l to associate matrix family those metrics show the proportion of L to sectors per unit from the given results in this example the top unit is L to associate X rayed this value means at eighty four point eight percent of the transfer bite-through the l2 cache originated from any one text read so we know that the best way to reduce the number of random access is to reduce the number of read buys requested by l1 text by observing the hit rates we see that the l1 tech sector hit rate value is 75 percent and the l2 read hit rate from l1 text value is 49 put 8% this poor l2 hit rates implies that the l1 tax rates are thrashing the l2 cache which typically happens because the working set size of simultaneously executing l1 text reads is much greater than the l2 cache size fixing the issue it turns out that the hitch headers of this ray tracing workload were fetching all 2d textures with mid level hard-coded to zero a well-known way to reduce the l2 walk instead size of the 2d texture fetches is to use mipmapping because only mid levels that are accessed are resident in l2 and a coarser level occupy less bytes mid maps were already populated by the engine so all we needed to do was to replace the hard-coded nib equals 0 with some dynamic mid level more information of the technique taken found in a blog here is the trace taken after the fix a good way for before-and-after comparison is by launching the trace compared to the easiest way to launch it is by choosing the two files you would like to compare right-click and select the trace compare you can also identify which frames you would like to compare in in case you traced multiple frames the tool shows frames one above the other and correlates the timeline so if you select a specific marker it will automatically select the corresponding marker in the other frame in the metrics pane you can see the values and absolute Delta back to our Wolfenstein example in this example we have reduced the time of the trace rays marker by 12% as to read hit rate from l1 text improve greatly from 50% to 83 and the l1 tech sector hit rate also slightly improved in conclusion if you'd like to understand the performance limiters of a frame you can use the putress for that once you figured that you are not cpu limited you can use that vents mode to apply the P 3 method to derive the main performance limiters of that workload thank you for watching GPU trace feature spotlight you can download the latest inset graphics from Nvidia developer site and also visit the useful links below

Original Description

Check out our latest feature spotlight on GPU Trace, a new frame-level profiler for graphics applications within NVIDIA NSight. With GPU Trace on #RTX, developers can collect GPU performance data on a single pass. - Download the latest version of Nsight Graphics: https://nvda.ws/3deeXrP - Visit Louis Bavoil’s blog for the Peak-Performance-Percentage Analysis: https://nvda.ws/2zQvLro
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from NVIDIA Developer · NVIDIA Developer · 23 of 60

1 Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing
Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing
NVIDIA Developer
2 Ray Tracing Essentials Part 3: Ray Tracing Hardware
Ray Tracing Essentials Part 3: Ray Tracing Hardware
NVIDIA Developer
3 Ray Tracing Essentials Part 4: The Ray Tracing Pipeline
Ray Tracing Essentials Part 4: The Ray Tracing Pipeline
NVIDIA Developer
4 NsightGraphics 2020 2 Release Spotlight
NsightGraphics 2020 2 Release Spotlight
NVIDIA Developer
5 Ray Tracing Essentials Part 5: Ray Tracing Effects
Ray Tracing Essentials Part 5: Ray Tracing Effects
NVIDIA Developer
6 Ray Tracing Essentials Part 6: The Rendering Equation
Ray Tracing Essentials Part 6: The Rendering Equation
NVIDIA Developer
7 Ray Tracing Essentials Part 7: Denoising for Ray Tracing
Ray Tracing Essentials Part 7: Denoising for Ray Tracing
NVIDIA Developer
8 Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)
Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)
NVIDIA Developer
9 Announcing Cloud-Native Support for Jetson Platform
Announcing Cloud-Native Support for Jetson Platform
NVIDIA Developer
10 JetsonTV: Build your next project with NVIDIA Jetson
JetsonTV: Build your next project with NVIDIA Jetson
NVIDIA Developer
11 Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression
Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression
NVIDIA Developer
12 Nsight Systems Feature Spotlight: OpenMP
Nsight Systems Feature Spotlight: OpenMP
NVIDIA Developer
13 Isaac Sim 2020: Deep Dive
Isaac Sim 2020: Deep Dive
NVIDIA Developer
14 NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale
NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale
NVIDIA Developer
15 NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
NVIDIA Developer
16 Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing
Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing
NVIDIA Developer
17 Synthesizing High-Resolution Images with StyleGAN2
Synthesizing High-Resolution Images with StyleGAN2
NVIDIA Developer
18 NVIDIA Robotics: Isaac SDK and Sim 2020.1
NVIDIA Robotics: Isaac SDK and Sim 2020.1
NVIDIA Developer
19 Accelerating COVID-19 Research with GPUs
Accelerating COVID-19 Research with GPUs
NVIDIA Developer
20 Visualizing 150 Terabytes of Data
Visualizing 150 Terabytes of Data
NVIDIA Developer
21 Boosting Performance and Utilization with Multi-Instance GPU
Boosting Performance and Utilization with Multi-Instance GPU
NVIDIA Developer
22 Running Multiple Workloads on a Single A100 GPU
Running Multiple Workloads on a Single A100 GPU
NVIDIA Developer
NVIDIA Nsight Feature Spotlight: GPU Trace
NVIDIA Nsight Feature Spotlight: GPU Trace
NVIDIA Developer
24 Spark 3 Demo: Comparing Performance of GPUs vs. CPUs
Spark 3 Demo: Comparing Performance of GPUs vs. CPUs
NVIDIA Developer
25 NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award
NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award
NVIDIA Developer
26 NVIDIA IndeX on Google Cloud Platform Marketplace
NVIDIA IndeX on Google Cloud Platform Marketplace
NVIDIA Developer
27 DeepStream SDK: Best practices for performance optimization
DeepStream SDK: Best practices for performance optimization
NVIDIA Developer
28 Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing
Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing
NVIDIA Developer
29 NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI
NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI
NVIDIA Developer
30 NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely
NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely
NVIDIA Developer
31 Advancing AR Glasses
Advancing AR Glasses
NVIDIA Developer
32 Blender Cycles: RTX On
Blender Cycles: RTX On
NVIDIA Developer
33 Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding
Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding
NVIDIA Developer
34 Assessing Property Damage with AI
Assessing Property Damage with AI
NVIDIA Developer
35 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
NVIDIA Developer
36 DaVinci Resolve Turns RTX On
DaVinci Resolve Turns RTX On
NVIDIA Developer
37 RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization
RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization
NVIDIA Developer
38 NVIDIA IndeX for arivis5D Cloud Platform
NVIDIA IndeX for arivis5D Cloud Platform
NVIDIA Developer
39 NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX
NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX
NVIDIA Developer
40 NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse
NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse
NVIDIA Developer
41 How to Create "Paint" in Substance Painter
How to Create "Paint" in Substance Painter
NVIDIA Developer
42 Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI
Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI
NVIDIA Developer
43 Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU
Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU
NVIDIA Developer
44 Accelerated Data Centers with NVIDIA and VMware
Accelerated Data Centers with NVIDIA and VMware
NVIDIA Developer
45 GPU-Accelerated Motion Blur in Blender Cycles
GPU-Accelerated Motion Blur in Blender Cycles
NVIDIA Developer
46 NVIDIA Clara Guardian Virtual Patient Assistant
NVIDIA Clara Guardian Virtual Patient Assistant
NVIDIA Developer
47 Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI
Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI
NVIDIA Developer
48 Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
NVIDIA Developer
49 Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion
Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion
NVIDIA Developer
50 Getting started with Jetson Nano 2GB Developer Kit
Getting started with Jetson Nano 2GB Developer Kit
NVIDIA Developer
51 NVIDIA Jetson Developer Community AI Projects
NVIDIA Jetson Developer Community AI Projects
NVIDIA Developer
52 Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit
Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit
NVIDIA Developer
53 Real-Time Ray Tracing with Project Lavina
Real-Time Ray Tracing with Project Lavina
NVIDIA Developer
54 Jetson AI Fundamentals - S1E2 - Hello Camera
Jetson AI Fundamentals - S1E2 - Hello Camera
NVIDIA Developer
55 Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100
Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100
NVIDIA Developer
56 Jetson AI Fundamentals - S1E4 - Image Regression Project
Jetson AI Fundamentals - S1E4 - Image Regression Project
NVIDIA Developer
57 Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware
Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware
NVIDIA Developer
58 Jetson AI Fundamentals - S2E2 - JetBot Software Setup
Jetson AI Fundamentals - S2E2 - JetBot Software Setup
NVIDIA Developer
59 Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack
Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack
NVIDIA Developer
60 Jetson AI Fundamentals - S1E3 - Image Classification Project
Jetson AI Fundamentals - S1E3 - Image Classification Project
NVIDIA Developer

This video teaches developers how to use NVIDIA Nsight's GPU Trace feature to profile and optimize graphics applications, with a focus on collecting GPU performance data and analyzing frame-level execution. By following the steps outlined in the video, developers can improve the performance of their applications and reduce potential security vulnerabilities. The video is particularly useful for developers working with DirectX and Vulkan applications on Windows and Linux.

Key Takeaways
  1. Create a new project in GPU Trace
  2. Choose the GPU Trace activity and application executable path, command, and arguments
  3. Set the number of frames to profile
  4. Launch the application and click the Generate Trace Capture button
  5. Analyze the trace data in the Timeline view, Matrix information tab, and Ranges table
  6. Activate advanced mode matrix set
  7. Choose the advanced mode matrix set
  8. Capture frames with different matrix set
  9. Observe units for put top unit
  10. Reduce VM accesses by increasing cache rates or reducing texture formats
💡 The use of mipmapping can significantly reduce L2 cache usage and improve L1 texture hit rates, leading to improved application performance.

Related AI Lessons

You Are Not Behind. The World Is.
You're not behind, the world is still adapting to AI, and it's okay to take your time to learn and grow
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Learn how to choose between a Computer Science and Engineering career path or combining programming with a core engineering background in the age of AI
Dev.to AI
The AI Hype Cycle: Calm Before the Next Breakthrough?
Understand the AI hype cycle to anticipate the next breakthrough and make informed decisions
Medium · Programming
AI won’t replace scientists. It will make the current model of science obsolete
AI is not replacing scientists, but rather making the current model of science obsolete, enabling new forms of discovery and collaboration
Medium · Data Science
Up next
Motorist saved by human chain | 9 News Australia
9 News Australia
Watch →