Memory Analysis with NVIDIA Nsight Compute | CUDA Developer Tools

NVIDIA Developer · Beginner ·⚡ Algorithms & Data Structures ·2y ago

Key Takeaways

This video tutorial introduces memory workload analysis for CUDA applications using NVIDIA Nsight Compute, covering memory hierarchy, cache management, and optimization techniques to increase cache hit rates and reduce memory bottlenecks. The tutorial demonstrates how to use NVIDIA Nsight Compute to analyze memory workloads, identify memory stalls, and optimize kernel development for better performance.

Full Transcript

welcome to this chapter of our Cuda developer tool series on inside compute today we will take a look at the memory workload analysis my name is Maximillian and I am a developer technology engineer at Nvidia GPU workloads often suffer from being heavily memory bound with ever increasing compute especially in content creation where video frames are being streamed through memory with sometimes little processing like a color adjustment filter the memory workload analysis section of inside compute helps to guide us to achieve the full memory bandwidth of our GPU and optimize data access patterns to increase the cash hit rate if you've not seen the introduction to inside compute featuring the speed of light section you might want to think about starting with Bob's video first which is linked in the video description besides that I highly recommend watching the linked gtz presentation about the memory hierarchy and memory transactions on an Nvidia GPU in more detail than I will cover in this video it is crucial to understand the way memory is requested and what cach lines and sectors are in the context of Cuda we will start with a theoretical walkr of the memory analysis chart where we get an overview of Hardware memory locality and memory type the green elements denote logical units and software each texture memory uh with different ordering or simple Global local memory that refers to Raw memory all physical units and Hardware are colored blue arrow links between physical units denote the amount of r or written bites across this connection the color of the arrow provides some insights into how far we are from the peak through put of this interconnect by the drop-down menu we can also switch between the total bytes or the peak through put percentage labeling of a connection Upon A Memory request inside our kernel depicted on left we check if the requested sector behind an address is already present in L1 if not we fall through to L2 and check for the Existence there if none of the above contained sector is fetched from device memory our L2 cache is the point of Truth for all streaming multiprocessors what I mean by that is that all memory traffic has to go through L2 including PCI copies or device memory rights only if a sector to Red is already present in L1 the L2 cache is not used the L2 cache ensures that no SM reads invalid data from device memory that has not being written back as it is shared across the whole GPU while L1 memory is local to one screaming multiprocessor we can reduce the traffic between L2 and device memory by increasing L2 hit rate and we can following reduce the traffic between L1 and L2 with a higher L1 hit hit rate of course traffic can also be reduced by changing the algorithm to Simply require less data to be transferred those hit rates are important to ensure low l of memory requests as fetches from device memory take many cycles to understand more about the communication between the different types of memory and about how the hit rate is calculated it is important to understand at what granularity the different units communicate cach lines are 128 bytes long in L1 and L2 but any communication on the GPU is done in sectors whose size is only 32 byte while on new hardw hardware more than one sector can be communicated at once a single sector is still the smallest communication possible in the memory system the concept of cach lines is important as we cannot allocate single sectors in Cache but always full cach lines into which we then read contigous sectors as a practical example why Cash Line allocation is important we will assume a cach of only two cach lines if we now read a picture in which each row is over 128 bytes long we see the following reading the first entry of the first row and the first of the second row we would each allocate a cach line if we now read the third row we have to invalidate a cach line to read the next we would not be able to reuse any cach lines this essentially lowers our possible cash storage to 25% if we read the same memory in rows rather than columns we read a full cach line then the next and so on after reading both full cach lines we again have to invalidate but we were able to hold on to much more sectors and therefore more values much longer especially with wider prefetches that transfer two sectors at a time we also increase our latency the previous explanation does hold for L2 and L1 cach if we would read 32 consecutive bytes in one SM we would have a hit rate of 97% in L1 this would result in 0% in L2 because we sent only one request to L2 and after that has been executed all data is present in L1 and will be cach cach hit on further requests if we on the other hand have a kernel that reads values with a stride of 32 by meaning only the first value within one sector we get 0% L1 cash hit rate L2 will only have to reallocate a new cach line for every fourth value when reading with a 32 by trite but has to request data from device memory on each request if we assume no prefetch is happening in this case the earlier we have a cach hit the lower the memory latency is with device memory clearly being the worst case because this memory is not on chip we have not talked about shared memory yet and especially the improvements that it got with ere shared memory is per smm memory and is effectively L1 memory with the difference that this memory is not written back to device memory and is meant as a workspace memory by requesting shared memory one reduces the available L1 memory this way data can be shared across threads in a block running on the same SM without going through L2 this is also an effective measure to manage L1 cach manually if the desired caching is not achieved automatically with M the load Global store shared instruction was introduced which allows to directly read data to shed memory without going through unmanaged L1 memory first just a brief Outlook preloading data to Shar memory is a commonly used pattern for sliding window operators like a convolution all threads in the same CTA or also SM are used to preload data to sh memory after that the data is accessed by all the other threads and the weights are persistent in L1 cach for example this already brings us to the load and store address spaces used in our kernel here shown as green boxes the most basic loads are Global and local loads that will load raw bite data for surface and texture loads we can possibly see different indexing schemes for example for set order textures furthermore we see the shared load operation for MP and later gpus that we talked about before it is important to understand the difference between local and Global load stores any local load and store is usually something we want to get rid of the origin of these local operations is often due to register spilling for which I'll have a link in the video description this means the kernel is not able to keep its working data in the physical register file and has to write back to memory this has very strong performance implications finding the origin of those local operations can be done using this address Bas column of the memory chart or by using the source page of inside compute and looking for LDL and STL instructions the usage of this Source page will be covered in a separate tutorial looking at it can be very helpful to identify the exact line of code that is responsible for this local axis all absolute values that we see in the memory workload analysis chart are shown in the table below which I will not go into detail about right now but while I want to give some example usage on how to use this section the code for the following example can be found on GitHub and is linked in the video description we will profile this very simple program to convert an 8bit PNG from rgba to cayc what we do in our kernel is we load four consecutive pixel values which are our corresponding rgba values apply a multiplier to each and add them up to a single grayscale float value then we downcast and save it back to our output buffer for now I collapsed the optimized versions please explore the code for those on GitHub first let's make sure we have our theoretical values set we are reading a four Channel un 8 picture of size 3,840 by 2,160 and right at the same size but only one channel un and8 so we expect a load of around 33 megabyte and a store of around 8 mgab while Insight can give us a good measurement of how far we are from the Cuda device Hardware limits it cannot tell us if our processing approach is suboptimal this is something we should always keep in mind I will use inside computes capability to remotely profile using an sssh connection we will start with creating a new project and establishing our connection I usually name my project with a percent I in the name that way andet compute will simply enumerate the profiles I take without overwriting an old file or me needing to remove an old file to get the memory chart analysis we have to either select the detailed profile or the full profile I recommend simply going with the full profile furthermore I also like to import the source and the others tab as it will allow me to track my source changes across my profiles in case I do not version them correctly now we can simply hit the launch button on the first launch it will upload all needed binaries which can actually take some time for me I've done that already so our Baseline will straight pop up before we dive into our memory workload analysis we can already see in the speed of light section that our memory speed speed of light is already at 80% this usually indicates a very well performing kernel but our compute speed of light is very low to me seems lower than expected looking at the memory work Noe analysis section we see that the link to device memory is at very low Peak utilization judging by its color but the link between L2 and L1 shows much more data flow than we would expect given our pointwise operation the cach hit ratio for L2 also seems superb but things are sometimes not as good as they look which is why it is important to understand why we calculated the previously mentioned theoretical numbers we want to know if we are really limited by Hardware or if our algorithm is suboptimal inside compute can only show correct speed of light measurements if the used algorithm is limited by Hardware it cannot tell us if we are doing suboptimal or super fluous processing what is happening here is that as we read 256 pixels vertically in one plock that each thread requests data for one pixel assuming in an initialized cache this leads to allocating a cach line in L2 and L1 then loading the sector of that line from Global memory in which the pixel data is located as we are processing vertically we request 256 cach lines and 256 sectors but we only read 256 * 4 bytes from those sectors although for for each of those 256 sectors we have to transmit 32 bytes even worse is that the same is happening for storing our Cay scale data to store data we also request 256 sectors store one bite to them and write the complete sector back we are actually lucky with our problem to not generate even more device memory requests from L2 due to having that little data that fits in our L2 cache in this case inside compute even provides warnings that indicate that our algorithm is not doing what it should do they tell us very clearly that we read little data of our requested sectors and that we should try using more sectors this kernel only accesses on average one sector out of the possible four sectors per catch line this is exactly what we just talked about to compare results more easily we will set the current unaligned report as our Baseline changing the axis from columns to rows is possible for this processing by simply changing the Block dimmension in the kernel launch I predefined some optimization levels so let's just launch with another command line argument which takes care of this now looking at the aligned reads we immediately see that we are about eight times faster the speed of light section also indicates higher throughput with less gap between compute and memory in the memory chart I'll link between device memory and L2 is very close to Peak Performance we load the expected amount of data from L2 to L1 and thinking about our cach hit rate we also see what we expect 75% L1 hit rate which is due to requesting all bytes of the neighboring pixels in the neighboring threads what else can we learn given the report we know that the kernel is heavily memory latency bound as we have to wait for device memory on each pixel's compute in such cases it can be beneficial to process multiple values in one thread let's check what we gain if we process four values after profiling we see we did not gain much if anything nonetheless let's investigate the memory chart closer we see that the request to Global memory did not decrease nor did the L2 to L1 which we also did not expect but our L1 store to L2 increased and is not matching the expected 8 megabyte why is that the case we are processing four pixels per thread now but after every pixel we write one by to our image this needs to be flushed to L2 which artificially increases the throughput from L1 to L2 as we Rite with an U and a so little data at the same time threats that write to the same sector at the same time can actually be grouped together in one single right operation but the weight period for this is very short this brings us to vectorized loads which is the last iteration on on this kernel we can use one uar to load all values for one pixel but we would still have to write back a single U and a going with even wider operations ends up at reading one u in four which then can process four pixels on one thread and write back S one U Char for this resembles basically exactly the same as we've been doing in the multiple value C before with both these optimizations we have to make sure that we have enough work available to not reduce the number of plocks in our crit so much that we are no longer able to saturate the GPU after reprofiling taking another look at the mermaid chart we see that our right to L2 is back to the expected 8 megabyte in addition we have no pressure anymore on the load store unit and did cut our requests by over 90% this can also be seen when comparing the compute speed of light section of our line report and the vectorized load as we have to do less indexing math on the pointers nonetheless the kernel did not get much faster as we were already limited by device memory in the previous version assuming our initial case of reading in columns is the way we must process our data this approach can also be very helpful to make each load a little less costly in conclusion we can say that the best use of the memory workload section is to ensure that you load only as much memory as you really need and linning read accesses with sector or even cach line boundaries is very beneficial for alignment issues there will also be hints in the source section this will help identifying memory stalls together with Source correlation this is part of a later tutorial as the follow up on this video I can only recommend the excellent documentation inside our kernel profiling guide that has even more details on the memory workload analysis section than I could ever cover in this video thank you very much for listening and I hope you learned a lot about the memory subsystem of an Nvidia GPU and how to profile using inside compute

Original Description

This tutorial video introduces memory workload analysis for CUDA applications with NVIDIA Nsight Compute. Memory bottlenecks can limit the performance of your GPU. This is especially true for content creation and other workloads that involve large amounts of data quickly streaming through memory. Use Nsight Compute memory workload analysis to maximize GPU memory bandwidth and optimize data access patterns. Highlights of this video tutorial include: Memory analysis chart: ▫️ This chart visualizes hardware memory locality and memory type, including the amount of read or written bytes between physical units. Overview of caches: ▫️ Memory requests in the kernel follow a hierarchy. L1 is checked first, then L2, and if the sector is not found, it is fetched from device memory. Optimizing caches: ▫️ Cache line allocation is crucial for optimal performance, ensuring efficient use of cache storage and reducing memory traffic between L1, L2, and device memory. Live demonstration: ▫️ Walkthrough optimizing a simple CUDA program that converts 8-bit PNGs from RGBA to grayscale. We inspect the impact of aligned reads and vectorized loads on memory efficiency. Interpreting memory analysis: ▫️ Key tips for how to read memory profiles to address the balance between hardware limitations and algorithmic efficiency. 0:00 - Introduction 0:58 - Memory Chart 3:51 - Cache Line Allocation 4:56 - L1 and L2 Cache 7:15 - Load and Store Address Spaces 8:48 - Sample Code 9:56 - Memory Workload Analysis 12:02 - Reading RGBA Values 13:08 - Aligned Loads 15:54 - Vectorized Loads 17:32 - Conclusion Important resources: ▫️ Introduction to NVIDIA Nsight Compute: https://www.youtube.com/watch?v=Iuy_RAvguBM ▫️ SOL Analysis with NVIDIA Nsight Compute: https://www.youtube.com/watch?v=uHN5fpfu8As ▫️ Memory Management on Modern GPU Architectures: https://resources.nvidia.com/gtcd-2020/GTC2020cwe21754?lx=3X9y6T ▫️Sample code: https://github.com/NVIDIA/nsight-training/tree/master/cuda/nsight_c
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from NVIDIA Developer · NVIDIA Developer · 0 of 60

← Previous Next →
1 Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing
Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing
NVIDIA Developer
2 Ray Tracing Essentials Part 3: Ray Tracing Hardware
Ray Tracing Essentials Part 3: Ray Tracing Hardware
NVIDIA Developer
3 Ray Tracing Essentials Part 4: The Ray Tracing Pipeline
Ray Tracing Essentials Part 4: The Ray Tracing Pipeline
NVIDIA Developer
4 NsightGraphics 2020 2 Release Spotlight
NsightGraphics 2020 2 Release Spotlight
NVIDIA Developer
5 Ray Tracing Essentials Part 5: Ray Tracing Effects
Ray Tracing Essentials Part 5: Ray Tracing Effects
NVIDIA Developer
6 Ray Tracing Essentials Part 6: The Rendering Equation
Ray Tracing Essentials Part 6: The Rendering Equation
NVIDIA Developer
7 Ray Tracing Essentials Part 7: Denoising for Ray Tracing
Ray Tracing Essentials Part 7: Denoising for Ray Tracing
NVIDIA Developer
8 Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)
Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)
NVIDIA Developer
9 Announcing Cloud-Native Support for Jetson Platform
Announcing Cloud-Native Support for Jetson Platform
NVIDIA Developer
10 JetsonTV: Build your next project with NVIDIA Jetson
JetsonTV: Build your next project with NVIDIA Jetson
NVIDIA Developer
11 Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression
Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression
NVIDIA Developer
12 Nsight Systems Feature Spotlight: OpenMP
Nsight Systems Feature Spotlight: OpenMP
NVIDIA Developer
13 Isaac Sim 2020: Deep Dive
Isaac Sim 2020: Deep Dive
NVIDIA Developer
14 NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale
NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale
NVIDIA Developer
15 NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge
NVIDIA Developer
16 Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing
Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing
NVIDIA Developer
17 Synthesizing High-Resolution Images with StyleGAN2
Synthesizing High-Resolution Images with StyleGAN2
NVIDIA Developer
18 NVIDIA Robotics: Isaac SDK and Sim 2020.1
NVIDIA Robotics: Isaac SDK and Sim 2020.1
NVIDIA Developer
19 Accelerating COVID-19 Research with GPUs
Accelerating COVID-19 Research with GPUs
NVIDIA Developer
20 Visualizing 150 Terabytes of Data
Visualizing 150 Terabytes of Data
NVIDIA Developer
21 Boosting Performance and Utilization with Multi-Instance GPU
Boosting Performance and Utilization with Multi-Instance GPU
NVIDIA Developer
22 Running Multiple Workloads on a Single A100 GPU
Running Multiple Workloads on a Single A100 GPU
NVIDIA Developer
23 NVIDIA Nsight Feature Spotlight: GPU Trace
NVIDIA Nsight Feature Spotlight: GPU Trace
NVIDIA Developer
24 Spark 3 Demo: Comparing Performance of GPUs vs. CPUs
Spark 3 Demo: Comparing Performance of GPUs vs. CPUs
NVIDIA Developer
25 NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award
NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award
NVIDIA Developer
26 NVIDIA IndeX on Google Cloud Platform Marketplace
NVIDIA IndeX on Google Cloud Platform Marketplace
NVIDIA Developer
27 DeepStream SDK: Best practices for performance optimization
DeepStream SDK: Best practices for performance optimization
NVIDIA Developer
28 Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing
Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing
NVIDIA Developer
29 NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI
NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI
NVIDIA Developer
30 NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely
NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely
NVIDIA Developer
31 Advancing AR Glasses
Advancing AR Glasses
NVIDIA Developer
32 Blender Cycles: RTX On
Blender Cycles: RTX On
NVIDIA Developer
33 Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding
Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding
NVIDIA Developer
34 Assessing Property Damage with AI
Assessing Property Damage with AI
NVIDIA Developer
35 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
NVIDIA Developer
36 DaVinci Resolve Turns RTX On
DaVinci Resolve Turns RTX On
NVIDIA Developer
37 RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization
RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization
NVIDIA Developer
38 NVIDIA IndeX for arivis5D Cloud Platform
NVIDIA IndeX for arivis5D Cloud Platform
NVIDIA Developer
39 NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX
NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX
NVIDIA Developer
40 NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse
NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse
NVIDIA Developer
41 How to Create "Paint" in Substance Painter
How to Create "Paint" in Substance Painter
NVIDIA Developer
42 Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI
Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI
NVIDIA Developer
43 Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU
Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU
NVIDIA Developer
44 Accelerated Data Centers with NVIDIA and VMware
Accelerated Data Centers with NVIDIA and VMware
NVIDIA Developer
45 GPU-Accelerated Motion Blur in Blender Cycles
GPU-Accelerated Motion Blur in Blender Cycles
NVIDIA Developer
46 NVIDIA Clara Guardian Virtual Patient Assistant
NVIDIA Clara Guardian Virtual Patient Assistant
NVIDIA Developer
47 Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI
Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI
NVIDIA Developer
48 Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
NVIDIA Developer
49 Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion
Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion
NVIDIA Developer
50 Getting started with Jetson Nano 2GB Developer Kit
Getting started with Jetson Nano 2GB Developer Kit
NVIDIA Developer
51 NVIDIA Jetson Developer Community AI Projects
NVIDIA Jetson Developer Community AI Projects
NVIDIA Developer
52 Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit
Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit
NVIDIA Developer
53 Real-Time Ray Tracing with Project Lavina
Real-Time Ray Tracing with Project Lavina
NVIDIA Developer
54 Jetson AI Fundamentals - S1E2 - Hello Camera
Jetson AI Fundamentals - S1E2 - Hello Camera
NVIDIA Developer
55 Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100
Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100
NVIDIA Developer
56 Jetson AI Fundamentals - S1E4 - Image Regression Project
Jetson AI Fundamentals - S1E4 - Image Regression Project
NVIDIA Developer
57 Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware
Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware
NVIDIA Developer
58 Jetson AI Fundamentals - S2E2 - JetBot Software Setup
Jetson AI Fundamentals - S2E2 - JetBot Software Setup
NVIDIA Developer
59 Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack
Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack
NVIDIA Developer
60 Jetson AI Fundamentals - S1E3 - Image Classification Project
Jetson AI Fundamentals - S1E3 - Image Classification Project
NVIDIA Developer

This video tutorial teaches how to use NVIDIA Nsight Compute for memory workload analysis and optimization, covering key concepts such as memory hierarchy, cache management, and kernel development. By following the steps outlined in the tutorial, developers can improve the performance of their CUDA applications by reducing memory bottlenecks and increasing cache hit rates.

Key Takeaways
  1. Create a new project in NVIDIA Nsight Compute
  2. Establish an SSH connection
  3. Launch the profile with the full profile option
  4. Import the source and other tabs
  5. Track source changes across profiles
  6. Analyze memory workload using NVIDIA Nsight Compute
  7. Identify memory stalls and alignment issues
  8. Optimize kernel development for better performance
💡 Memory workload analysis is crucial for optimizing the performance of CUDA applications, and NVIDIA Nsight Compute provides a powerful tool for identifying memory bottlenecks and improving cache hit rates.

Related AI Lessons

Bloom Filters, Explained Properly
Learn how Bloom filters work and their benefits, including tiny memory and blazing speed, in exchange for potential false positives.
Dev.to · Daksh Gargas
Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant
Learn how prefix sums enable instant range queries in arrays, boosting performance in various applications
Medium · Programming
I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me
A simple math question can destroy a developer's interview, highlighting the importance of being prepared for unexpected questions
Medium · Programming
Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…
Learn to remove duplicates from a sorted array using the two pointers technique, improving from brute force to optimized solutions
Medium · Python

Chapters (11)

Introduction
0:58 Memory Chart
3:51 Cache Line Allocation
4:56 L1 and L2 Cache
7:15 Load and Store Address Spaces
8:48 Sample Code
9:56 Memory Workload Analysis
12:02 Reading RGBA Values
13:08 Aligned Loads
15:54 Vectorized Loads
17:32 Conclusion
Up next
Stump Grinder Carbide Wheel Grinds Hardwood To Chips
Innoforge Studio
Watch →