How is hardware reshaping LLM design?

Julia Turc · Advanced ·🧠 Large Language Models ·2mo ago

Skills: LLM Engineering90%

Why can an NVIDIA H100 GPU theoretically generate 62,000 tokens per second when in practice even the best inference engines struggle to reach 200 tokens per second? The answer lies in the memory wall. In this video, we break down why traditional auto-regressive LLMs are memory-bound. Using the roofline model, we analyze how GPU memory bandwidth, arithmetic intensity, and model architecture determine real-world LLM inference performance. We'll cover: • Why an 8B parameter LLM must stream its entire model weights for every output token • How HBM (High Bandwidth Memory) becomes the bottleneck • What the roofline model tells us about GPU utilization • Why prefill is compute-bound but decode is memory-bound • How inference engines like vLLM, SGLang, and TensorRT-LLM optimize batching and KV caching • How speculative decoding increases arithmetic intensity • Why diffusion-based LLMs shift the workload from memory-bound to compute-bound • How new architectures like block diffusion combine the best of both worlds We’ll also discuss recent advances in diffusion LLM inference, including work like FOCUS and block-wise decoding strategies. This video is sponsored by Inception (inceptionlabs.ai), who are building ultra-fast diffusion-based language models and the infrastructure to serve them efficiently. They recently introduced Mercury 2, a diffusion reasoning model designed for high-throughput inference. 📚 My full reading list (free): https://www.patreon.com/c/JuliaTurc ▶️ Other diffusion videos: https://youtube.com/playlist?list=PL4bm2lr9UVG3SN79Y6WBe4OOlEiO88vie&si=Tg4IgNkZdV9HSaGU 00:00 Intro 01:13 High-level GPU architecture 04:14 The memory bottleneck 05:51 The roofline model 08:34 Auto-regressive LLMs are memory-bound 10:40 Batching 12:55 How KV caching caps batching 14:25 Speculative decoding 16:00 Diffusion LLMs are compute-bound 18:53 Reducing wasted diffusion computations 20:34 Block diffusion 21:42 Diffusion inference engines

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related AI Lessons

Thursday Thoughts: The Models We Can't Run

Explore the limitations of running latest AI models and their implications on the AI community

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Big Tech firms are investing billions in AI, driving growth and transformation, while prioritizing safety and responsible adoption

35 ChatGPT Prompts for Recruiters (That Actually Work in 2026)

Learn 35 effective ChatGPT prompts for recruiters to streamline their workflow in 2026

Dev.to · ClawGear

Stop Writing Like a Robot: The Prompt That Makes ChatGPT Sound Human

Learn how to craft prompts that make ChatGPT sound human, overcoming lifeless AI writing

Medium · ChatGPT

Chapters (12)

Intro

1:13 High-level GPU architecture

4:14 The memory bottleneck

5:51 The roofline model

8:34 Auto-regressive LLMs are memory-bound

10:40 Batching

12:55 How KV caching caps batching

14:25 Speculative decoding

16:00 Diffusion LLMs are compute-bound

18:53 Reducing wasted diffusion computations

20:34 Block diffusion

21:42 Diffusion inference engines

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)