Inference Engines (Part 1)
GTC Sessions:
https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s82448/?ncid=ref-inpa-249-prsp-en-us-1-l33 (Deploying AI Agents at Enterprise Scale)
https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s81558/?ncid=ref-inpa-249-prsp-en-us-1-l33 (Post-Training Nemotron With RL)
NVIDIA 4080 Super Giveaway:
https://docs.google.com/forms/d/1K_70PPbO69ygP32h6PwjDmw8pSeUS97Tk82RVUvHBRY/edit?usp=sharing
Inference is an important topic but rather underappreciated especially given the potential gain in how fast and efficient we can run the underlying models. As models grow and architectures are getting more complex, it's important to understand some of the key components when it comes to actually running these models for inference.
How did they change over the years? and how has advancements in NVMe, PCIe, and HBM affect it? What will SGLang, vLLM, NVIDIA Dynamo, and Tensor-RT be shaped going forward?
#ai #deeplearning #inference #datacenters
Chapters
00:00 Intro
01:18 Model Parallelism
02:26 MP Benefits
02:41 SLO
04:19 MP Limitations
04:44 Inference Engine
05:30 Batching
06:46 KV Cache
07:34 Part 2?
07:54 GTC 2026
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Related AI Lessons
⚡
⚡
⚡
⚡
I Used AI Agents to Build 3 Real Businesses in a Day — Wix Headless Runs Them All
Medium · AI
How I Built a Smart Surf Alert System with Friday in Under 10 Minutes
Medium · AI
How I Built a Smart Surf Alert System with Friday in Under 10 Minutes
Medium · Startup
Modern Lojistikte Görünmez Mimari:
Algoritmalar ve Yapay Zekâ ile Rota Optimizasyonu
Medium · AI
Chapters (10)
Intro
1:18
Model Parallelism
2:26
MP Benefits
2:41
SLO
4:19
MP Limitations
4:44
Inference Engine
5:30
Batching
6:46
KV Cache
7:34
Part 2?
7:54
GTC 2026
🎓
Tutor Explanation
DeepCamp AI