Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft

Name: Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft
Uploaded: 2026-04-20T20:22:24Z
Channel: PyTorch
Description: Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft Making your GPUs go brrr is complex. Efficient LLM inf...

PyTorch · Beginner ·🏭 MLOps & LLMOps ·1mo ago

Skills: LLMOps80%

Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft Making your GPUs go brrr is complex. Efficient LLM inference requires navigating a maze of optimization techniques each with different trade-offs. This session provides a practical journey through inference optimizations, clearly categorized by implementation effort. We'll explore techniques across three levels: - Model choices (start here): Model selection, quantization, smart routing - Library-level improvements (using PyTorch-based frameworks like vLLM, SGLang, TensorRT-LLM): Continuous batching, KV-cache management, tensor parallelism - Custom implementations: Speculative decoding with custom draft heads, disaggregated inference, fine-tuning smaller models The session covers practical trade-offs and key metrics: time to first token, inter-token latency, throughput, and cost per token. Whether deploying your first model or optimizing at scale, this talk delivers actionable insights into which techniques to prioritize for deeper investigation.

Watch on YouTube ↗ (saves to browser)