Memory Analysis with NVIDIA Nsight Compute | CUDA Developer Tools

NVIDIA Developer · Beginner ·⚡ Algorithms & Data Structures ·2y ago

Skills: LLM Foundations80%LLM Engineering70%

Key Takeaways

This video tutorial introduces memory workload analysis for CUDA applications using NVIDIA Nsight Compute, covering memory hierarchy, cache management, and optimization techniques to increase cache hit rates and reduce memory bottlenecks. The tutorial demonstrates how to use NVIDIA Nsight Compute to analyze memory workloads, identify memory stalls, and optimize kernel development for better performance.

Full Transcript

welcome to this chapter of our Cuda developer tool series on inside compute today we will take a look at the memory workload analysis my name is Maximillian and I am a developer technology engineer at Nvidia GPU workloads often suffer from being heavily memory bound with ever increasing compute especially in content creation where video frames are being streamed through memory with sometimes little processing like a color adjustment filter the memory workload analysis section of inside compute helps to guide us to achieve the full memory bandwidth of our GPU and optimize data access patterns to increase the cash hit rate if you've not seen the introduction to inside compute featuring the speed of light section you might want to think about starting with Bob's video first which is linked in the video description besides that I highly recommend watching the linked gtz presentation about the memory hierarchy and memory transactions on an Nvidia GPU in more detail than I will cover in this video it is crucial to understand the way memory is requested and what cach lines and sectors are in the context of Cuda we will start with a theoretical walkr of the memory analysis chart where we get an overview of Hardware memory locality and memory type the green elements denote logical units and software each texture memory uh with different ordering or simple Global local memory that refers to Raw memory all physical units and Hardware are colored blue arrow links between physical units denote the amount of r or written bites across this connection the color of the arrow provides some insights into how far we are from the peak through put of this interconnect by the drop-down menu we can also switch between the total bytes or the peak through put percentage labeling of a connection Upon A Memory request inside our kernel depicted on left we check if the requested sector behind an address is already present in L1 if not we fall through to L2 and check for the Existence there if none of the above contained sector is fetched from device memory our L2 cache is the point of Truth for all streaming multiprocessors what I mean by that is that all memory traffic has to go through L2 including PCI copies or device memory rights only if a sector to Red is already present in L1 the L2 cache is not used the L2 cache ensures that no SM reads invalid data from device memory that has not being written back as it is shared across the whole GPU while L1 memory is local to one screaming multiprocessor we can reduce the traffic between L2 and device memory by increasing L2 hit rate and we can following reduce the traffic between L1 and L2 with a higher L1 hit hit rate of course traffic can also be reduced by changing the algorithm to Simply require less data to be transferred those hit rates are important to ensure low l of memory requests as fetches from device memory take many cycles to understand more about the communication between the different types of memory and about how the hit rate is calculated it is important to understand at what granularity the different units communicate cach lines are 128 bytes long in L1 and L2 but any communication on the GPU is done in sectors whose size is only 32 byte while on new hardw hardware more than one sector can be communicated at once a single sector is still the smallest communication possible in the memory system the concept of cach lines is important as we cannot allocate single sectors in Cache but always full cach lines into which we then read contigous sectors as a practical example why Cash Line allocation is important we will assume a cach of only two cach lines if we now read a picture in which each row is over 128 bytes long we see the following reading the first entry of the first row and the first of the second row we would each allocate a cach line if we now read the third row we have to invalidate a cach line to read the next we would not be able to reuse any cach lines this essentially lowers our possible cash storage to 25% if we read the same memory in rows rather than columns we read a full cach line then the next and so on after reading both full cach lines we again have to invalidate but we were able to hold on to much more sectors and therefore more values much longer especially with wider prefetches that transfer two sectors at a time we also increase our latency the previous explanation does hold for L2 and L1 cach if we would read 32 consecutive bytes in one SM we would have a hit rate of 97% in L1 this would result in 0% in L2 because we sent only one request to L2 and after that has been executed all data is present in L1 and will be cach cach hit on further requests if we on the other hand have a kernel that reads values with a stride of 32 by meaning only the first value within one sector we get 0% L1 cash hit rate L2 will only have to reallocate a new cach line for every fourth value when reading with a 32 by trite but has to request data from device memory on each request if we assume no prefetch is happening in this case the earlier we have a cach hit the lower the memory latency is with device memory clearly being the worst case because this memory is not on chip we have not talked about shared memory yet and especially the improvements that it got with ere shared memory is per smm memory and is effectively L1 memory with the difference that this memory is not written back to device memory and is meant as a workspace memory by requesting shared memory one reduces the available L1 memory this way data can be shared across threads in a block running on the same SM without going through L2 this is also an effective measure to manage L1 cach manually if the desired caching is not achieved automatically with M the load Global store shared instruction was introduced which allows to directly read data to shed memory without going through unmanaged L1 memory first just a brief Outlook preloading data to Shar memory is a commonly used pattern for sliding window operators like a convolution all threads in the same CTA or also SM are used to preload data to sh memory after that the data is accessed by all the other threads and the weights are persistent in L1 cach for example this already brings us to the load and store address spaces used in our kernel here shown as green boxes the most basic loads are Global and local loads that will load raw bite data for surface and texture loads we can possibly see different indexing schemes for example for set order textures furthermore we see the shared load operation for MP and later gpus that we talked about before it is important to understand the difference between local and Global load stores any local load and store is usually something we want to get rid of the origin of these local operations is often due to register spilling for which I'll have a link in the video description this means the kernel is not able to keep its working data in the physical register file and has to write back to memory this has very strong performance implications finding the origin of those local operations can be done using this address Bas column of the memory chart or by using the source page of inside compute and looking for LDL and STL instructions the usage of this Source page will be covered in a separate tutorial looking at it can be very helpful to identify the exact line of code that is responsible for this local axis all absolute values that we see in the memory workload analysis chart are shown in the table below which I will not go into detail about right now but while I want to give some example usage on how to use this section the code for the following example can be found on GitHub and is linked in the video description we will profile this very simple program to convert an 8bit PNG from rgba to cayc what we do in our kernel is we load four consecutive pixel values which are our corresponding rgba values apply a multiplier to each and add them up to a single grayscale float value then we downcast and save it back to our output buffer for now I collapsed the optimized versions please explore the code for those on GitHub first let's make sure we have our theoretical values set we are reading a four Channel un 8 picture of size 3,840 by 2,160 and right at the same size but only one channel un and8 so we expect a load of around 33 megabyte and a store of around 8 mgab while Insight can give us a good measurement of how far we are from the Cuda device Hardware limits it cannot tell us if our processing approach is suboptimal this is something we should always keep in mind I will use inside computes capability to remotely profile using an sssh connection we will start with creating a new project and establishing our connection I usually name my project with a percent I in the name that way andet compute will simply enumerate the profiles I take without overwriting an old file or me needing to remove an old file to get the memory chart analysis we have to either select the detailed profile or the full profile I recommend simply going with the full profile furthermore I also like to import the source and the others tab as it will allow me to track my source changes across my profiles in case I do not version them correctly now we can simply hit the launch button on the first launch it will upload all needed binaries which can actually take some time for me I've done that already so our Baseline will straight pop up before we dive into our memory workload analysis we can already see in the speed of light section that our memory speed speed of light is already at 80% this usually indicates a very well performing kernel but our compute speed of light is very low to me seems lower than expected looking at the memory work Noe analysis section we see that the link to device memory is at very low Peak utilization judging by its color but the link between L2 and L1 shows much more data flow than we would expect given our pointwise operation the cach hit ratio for L2 also seems superb but things are sometimes not as good as they look which is why it is important to understand why we calculated the previously mentioned theoretical numbers we want to know if we are really limited by Hardware or if our algorithm is suboptimal inside compute can only show correct speed of light measurements if the used algorithm is limited by Hardware it cannot tell us if we are doing suboptimal or super fluous processing what is happening here is that as we read 256 pixels vertically in one plock that each thread requests data for one pixel assuming in an initialized cache this leads to allocating a cach line in L2 and L1 then loading the sector of that line from Global memory in which the pixel data is located as we are processing vertically we request 256 cach lines and 256 sectors but we only read 256 * 4 bytes from those sectors although for for each of those 256 sectors we have to transmit 32 bytes even worse is that the same is happening for storing our Cay scale data to store data we also request 256 sectors store one bite to them and write the complete sector back we are actually lucky with our problem to not generate even more device memory requests from L2 due to having that little data that fits in our L2 cache in this case inside compute even provides warnings that indicate that our algorithm is not doing what it should do they tell us very clearly that we read little data of our requested sectors and that we should try using more sectors this kernel only accesses on average one sector out of the possible four sectors per catch line this is exactly what we just talked about to compare results more easily we will set the current unaligned report as our Baseline changing the axis from columns to rows is possible for this processing by simply changing the Block dimmension in the kernel launch I predefined some optimization levels so let's just launch with another command line argument which takes care of this now looking at the aligned reads we immediately see that we are about eight times faster the speed of light section also indicates higher throughput with less gap between compute and memory in the memory chart I'll link between device memory and L2 is very close to Peak Performance we load the expected amount of data from L2 to L1 and thinking about our cach hit rate we also see what we expect 75% L1 hit rate which is due to requesting all bytes of the neighboring pixels in the neighboring threads what else can we learn given the report we know that the kernel is heavily memory latency bound as we have to wait for device memory on each pixel's compute in such cases it can be beneficial to process multiple values in one thread let's check what we gain if we process four values after profiling we see we did not gain much if anything nonetheless let's investigate the memory chart closer we see that the request to Global memory did not decrease nor did the L2 to L1 which we also did not expect but our L1 store to L2 increased and is not matching the expected 8 megabyte why is that the case we are processing four pixels per thread now but after every pixel we write one by to our image this needs to be flushed to L2 which artificially increases the throughput from L1 to L2 as we Rite with an U and a so little data at the same time threats that write to the same sector at the same time can actually be grouped together in one single right operation but the weight period for this is very short this brings us to vectorized loads which is the last iteration on on this kernel we can use one uar to load all values for one pixel but we would still have to write back a single U and a going with even wider operations ends up at reading one u in four which then can process four pixels on one thread and write back S one U Char for this resembles basically exactly the same as we've been doing in the multiple value C before with both these optimizations we have to make sure that we have enough work available to not reduce the number of plocks in our crit so much that we are no longer able to saturate the GPU after reprofiling taking another look at the mermaid chart we see that our right to L2 is back to the expected 8 megabyte in addition we have no pressure anymore on the load store unit and did cut our requests by over 90% this can also be seen when comparing the compute speed of light section of our line report and the vectorized load as we have to do less indexing math on the pointers nonetheless the kernel did not get much faster as we were already limited by device memory in the previous version assuming our initial case of reading in columns is the way we must process our data this approach can also be very helpful to make each load a little less costly in conclusion we can say that the best use of the memory workload section is to ensure that you load only as much memory as you really need and linning read accesses with sector or even cach line boundaries is very beneficial for alignment issues there will also be hints in the source section this will help identifying memory stalls together with Source correlation this is part of a later tutorial as the follow up on this video I can only recommend the excellent documentation inside our kernel profiling guide that has even more details on the memory workload analysis section than I could ever cover in this video thank you very much for listening and I hope you learned a lot about the memory subsystem of an Nvidia GPU and how to profile using inside compute

Original Description

This tutorial video introduces memory workload analysis for CUDA applications with NVIDIA Nsight Compute. Memory bottlenecks can limit the performance of your GPU. This is especially true for content creation and other workloads that involve large amounts of data quickly streaming through memory. Use Nsight Compute memory workload analysis to maximize GPU memory bandwidth and optimize data access patterns. Highlights of this video tutorial include: Memory analysis chart: ▫️ This chart visualizes hardware memory locality and memory type, including the amount of read or written bytes between physical units. Overview of caches: ▫️ Memory requests in the kernel follow a hierarchy. L1 is checked first, then L2, and if the sector is not found, it is fetched from device memory. Optimizing caches: ▫️ Cache line allocation is crucial for optimal performance, ensuring efficient use of cache storage and reducing memory traffic between L1, L2, and device memory. Live demonstration: ▫️ Walkthrough optimizing a simple CUDA program that converts 8-bit PNGs from RGBA to grayscale. We inspect the impact of aligned reads and vectorized loads on memory efficiency. Interpreting memory analysis: ▫️ Key tips for how to read memory profiles to address the balance between hardware limitations and algorithmic efficiency. 0:00 - Introduction 0:58 - Memory Chart 3:51 - Cache Line Allocation 4:56 - L1 and L2 Cache 7:15 - Load and Store Address Spaces 8:48 - Sample Code 9:56 - Memory Workload Analysis 12:02 - Reading RGBA Values 13:08 - Aligned Loads 15:54 - Vectorized Loads 17:32 - Conclusion Important resources: ▫️ Introduction to NVIDIA Nsight Compute: https://www.youtube.com/watch?v=Iuy_RAvguBM ▫️ SOL Analysis with NVIDIA Nsight Compute: https://www.youtube.com/watch?v=uHN5fpfu8As ▫️ Memory Management on Modern GPU Architectures: https://resources.nvidia.com/gtcd-2020/GTC2020cwe21754?lx=3X9y6T ▫️Sample code: https://github.com/NVIDIA/nsight-training/tree/master/cuda/nsight_c

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from NVIDIA Developer · NVIDIA Developer · 0 of 60

← Previous Next →

Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing

Ray Tracing Essentials Part 2: Rasterization versus Ray Tracing

NVIDIA Developer

Ray Tracing Essentials Part 3: Ray Tracing Hardware

Ray Tracing Essentials Part 3: Ray Tracing Hardware

NVIDIA Developer

Ray Tracing Essentials Part 4: The Ray Tracing Pipeline

Ray Tracing Essentials Part 4: The Ray Tracing Pipeline

NVIDIA Developer

NsightGraphics 2020 2 Release Spotlight

NsightGraphics 2020 2 Release Spotlight

NVIDIA Developer

Ray Tracing Essentials Part 5: Ray Tracing Effects

Ray Tracing Essentials Part 5: Ray Tracing Effects

NVIDIA Developer

Ray Tracing Essentials Part 6: The Rendering Equation

Ray Tracing Essentials Part 6: The Rendering Equation

NVIDIA Developer

Ray Tracing Essentials Part 7: Denoising for Ray Tracing

Ray Tracing Essentials Part 7: Denoising for Ray Tracing

NVIDIA Developer

Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)

Spatiotemporal Importance Resampling for Many-Light Ray Tracing (ReSTIR)

NVIDIA Developer

Announcing Cloud-Native Support for Jetson Platform

Announcing Cloud-Native Support for Jetson Platform

NVIDIA Developer

JetsonTV: Build your next project with NVIDIA Jetson

JetsonTV: Build your next project with NVIDIA Jetson

NVIDIA Developer

Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression

Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression

NVIDIA Developer

Nsight Systems Feature Spotlight: OpenMP

Nsight Systems Feature Spotlight: OpenMP

NVIDIA Developer

Isaac Sim 2020: Deep Dive

Isaac Sim 2020: Deep Dive

NVIDIA Developer

NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale

NVIDIA Jetson: Enabling AI-Powered Autonomous Machines at Scale

NVIDIA Developer

NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge

NVIDIA Tools to Train, Build, and Deploy Intelligent Vision Applications at the Edge

NVIDIA Developer

Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing

Jetson Xavier NX Developer Kit: The Next Leap in Edge Computing

NVIDIA Developer

Synthesizing High-Resolution Images with StyleGAN2

Synthesizing High-Resolution Images with StyleGAN2

NVIDIA Developer

NVIDIA Robotics: Isaac SDK and Sim 2020.1

NVIDIA Robotics: Isaac SDK and Sim 2020.1

NVIDIA Developer

Accelerating COVID-19 Research with GPUs

Accelerating COVID-19 Research with GPUs

NVIDIA Developer

Visualizing 150 Terabytes of Data

Visualizing 150 Terabytes of Data

NVIDIA Developer

Boosting Performance and Utilization with Multi-Instance GPU

Boosting Performance and Utilization with Multi-Instance GPU

NVIDIA Developer

Running Multiple Workloads on a Single A100 GPU

Running Multiple Workloads on a Single A100 GPU

NVIDIA Developer

NVIDIA Nsight Feature Spotlight: GPU Trace

NVIDIA Nsight Feature Spotlight: GPU Trace

NVIDIA Developer

Spark 3 Demo: Comparing Performance of GPUs vs. CPUs

Spark 3 Demo: Comparing Performance of GPUs vs. CPUs

NVIDIA Developer

NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award

NVIDIA Jetson Nano Wins Edge AI and Vision Alliance Award

NVIDIA Developer

NVIDIA IndeX on Google Cloud Platform Marketplace

NVIDIA IndeX on Google Cloud Platform Marketplace

NVIDIA Developer

DeepStream SDK: Best practices for performance optimization

DeepStream SDK: Best practices for performance optimization

NVIDIA Developer

Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing

Efficiently Deploying GPU Accelerated 5G CloudRAN for Edge AI Inferencing

NVIDIA Developer

NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI

NVIDIA PhysicsNeMo - Accelerating Scientific & Engineering Simulation Workflows with AI

NVIDIA Developer

NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely

NVIDIA Deep Learning Institute Instructor-Led Training Available Remotely

NVIDIA Developer

Advancing AR Glasses

Advancing AR Glasses

NVIDIA Developer

Blender Cycles: RTX On

Blender Cycles: RTX On

NVIDIA Developer

Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding

Real-Time GPU-Accelerated Data Analytics of 250 million Flight Data Records of 737 Max grounding

NVIDIA Developer

Assessing Property Damage with AI

Assessing Property Damage with AI

NVIDIA Developer

RAPIDS: GPU-Accelerated Data Analytics & Machine Learning

RAPIDS: GPU-Accelerated Data Analytics & Machine Learning

NVIDIA Developer

DaVinci Resolve Turns RTX On

DaVinci Resolve Turns RTX On

NVIDIA Developer

RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization

RAPIDS with Plotly Dash : GPU-Accelerated Census 2010 Visualization

NVIDIA Developer

NVIDIA IndeX for arivis5D Cloud Platform

NVIDIA IndeX for arivis5D Cloud Platform

NVIDIA Developer

NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX

NVIDIA Backchannel: Behind the Scenes of Marbles at Night RTX

NVIDIA Developer

NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse

NVIDIA Backchannel: Sneak Peek into Marbles RTX in Omniverse

NVIDIA Developer

How to Create "Paint" in Substance Painter

How to Create "Paint" in Substance Painter

NVIDIA Developer

Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI

Accelerate AI development for Computer Vision on the NVIDIA Jetson with alwaysAI

NVIDIA Developer

Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU

Securing Next Generation Apps over VMware Cloud Foundation with Bluefield-2 DPU

NVIDIA Developer

Accelerated Data Centers with NVIDIA and VMware

Accelerated Data Centers with NVIDIA and VMware

NVIDIA Developer

GPU-Accelerated Motion Blur in Blender Cycles

GPU-Accelerated Motion Blur in Blender Cycles

NVIDIA Developer

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI

Revolutionizing Supercomputing with NVIDIA UFM Cyber-AI

NVIDIA Developer

Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research

Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research

NVIDIA Developer

Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion

Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion

NVIDIA Developer

Getting started with Jetson Nano 2GB Developer Kit

Getting started with Jetson Nano 2GB Developer Kit

NVIDIA Developer

NVIDIA Jetson Developer Community AI Projects

NVIDIA Jetson Developer Community AI Projects

NVIDIA Developer

Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit

Open-source projects on NVIDIA Jetson Nano 2GB Developer Kit

NVIDIA Developer

Real-Time Ray Tracing with Project Lavina

Real-Time Ray Tracing with Project Lavina

NVIDIA Developer

Jetson AI Fundamentals - S1E2 - Hello Camera

Jetson AI Fundamentals - S1E2 - Hello Camera

NVIDIA Developer

Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100

Develop Optimized Conversational AI Models with NVIDIA NeMo on DGX A100

NVIDIA Developer

Jetson AI Fundamentals - S1E4 - Image Regression Project

Jetson AI Fundamentals - S1E4 - Image Regression Project

NVIDIA Developer

Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware

Jetson AI Fundamentals - S2E1 - JetBot Intro and Hardware

NVIDIA Developer

Jetson AI Fundamentals - S2E2 - JetBot Software Setup

Jetson AI Fundamentals - S2E2 - JetBot Software Setup

NVIDIA Developer

Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack

Jetson AI Fundamentals - S1E1 - First Time Setup with JetPack

NVIDIA Developer

Jetson AI Fundamentals - S1E3 - Image Classification Project

Jetson AI Fundamentals - S1E3 - Image Classification Project

NVIDIA Developer

This video tutorial teaches how to use NVIDIA Nsight Compute for memory workload analysis and optimization, covering key concepts such as memory hierarchy, cache management, and kernel development. By following the steps outlined in the tutorial, developers can improve the performance of their CUDA applications by reducing memory bottlenecks and increasing cache hit rates.

Key Takeaways

Create a new project in NVIDIA Nsight Compute
Establish an SSH connection
Launch the profile with the full profile option
Import the source and other tabs
Track source changes across profiles
Analyze memory workload using NVIDIA Nsight Compute
Identify memory stalls and alignment issues
Optimize kernel development for better performance

💡 Memory workload analysis is crucial for optimizing the performance of CUDA applications, and NVIDIA Nsight Compute provides a powerful tool for identifying memory bottlenecks and improving cache hit rates.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Bloom Filters, Explained Properly

Learn how Bloom filters work and their benefits, including tiny memory and blazing speed, in exchange for potential false positives.

Dev.to · Daksh Gargas

Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant

Learn how prefix sums enable instant range queries in arrays, boosting performance in various applications

Medium · Programming

I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me

A simple math question can destroy a developer's interview, highlighting the importance of being prepared for unexpected questions

Medium · Programming

Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…

Learn to remove duplicates from a sorted array using the two pointers technique, improving from brute force to optimized solutions

Medium · Python

Chapters (11)

Introduction

0:58 Memory Chart

3:51 Cache Line Allocation

4:56 L1 and L2 Cache

7:15 Load and Store Address Spaces

8:48 Sample Code

9:56 Memory Workload Analysis

12:02 Reading RGBA Values

13:08 Aligned Loads

15:54 Vectorized Loads

17:32 Conclusion

Stump Grinder Carbide Wheel Grinds Hardwood To Chips

Innoforge Studio