Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft
Skills:
LLMOps80%
Tour De Force: LLM Inference Optimization From Simple To Sophisticated - Christin Pohl, Microsoft
Making your GPUs go brrr is complex. Efficient LLM inference requires navigating a maze of optimization techniques each with different trade-offs. This session provides a practical journey through inference optimizations, clearly categorized by implementation effort.
We'll explore techniques across three levels:
- Model choices (start here): Model selection, quantization, smart routing
- Library-level improvements (using PyTorch-based frameworks like vLLM, SGLang, TensorRT-LLM): Continuous batching, KV-cache management, tensor parallelism
- Custom implementations: Speculative decoding with custom draft heads, disaggregated inference, fine-tuning smaller models
The session covers practical trade-offs and key metrics: time to first token, inter-token latency, throughput, and cost per token.
Whether deploying your first model or optimizing at scale, this talk delivers actionable insights into which techniques to prioritize for deeper investigation.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLMOps
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Things I Learned Building an End-to-End ML Pipeline on Kubernetes: From Validated Data to Live…
Medium · Machine Learning
Day 2: Set Up and Configure Jupyter Notebook Server | KodeKloud MLOps Journey
Medium · Machine Learning
Day 2: Set Up and Configure Jupyter Notebook Server | KodeKloud MLOps Journey
Medium · Data Science
Day 2: Set Up and Configure Jupyter Notebook Server | KodeKloud MLOps Journey
Medium · Python
🎓
Tutor Explanation
DeepCamp AI