The Future of Efficient LLM Serving: A Deep Dive with Travis Adair l Predibase

Predibase by Rubrik ยท Intermediate ยท๐Ÿง  Large Language Models ยท7mo ago
๐Ÿ”” SUBSCRIBE for the latest on LLM fine-tuning, AI scaling, and reinforcement learning hacks! ๐Ÿ‘‰https://www.youtube.com/@Predibase ๐Ÿ”— Try Predibaseโ€™s RFT Platform: https://predibase.com/free-trial ๐Ÿ‘‰ Schedule a live demo: https://pbase.ai/41FZKfy Discover how Predibase's serving stack, featuring Lorax (LoRA Exchange) and TurboLoRA, helps organizations overcome common inference bottlenecks. Travis explains key concepts like: โœ… Mult-LoRA Serving. Efficiently serving dozens or even hundreds of fine-tuned adapters on a single GPU. โœ… Speculative Decoding. Drastically improving inference speed bโ€ฆ
Watch on YouTube โ†— (saves to browser)

Chapters (18)

Intro to Let's Talk Tokens
1:25 Travis's Primer: An Overview of the Predibase Serving Stack
3:40 The Motivation Behind Lorax
6:50 The Problem with One-Size-Fits-All Models
8:45 The Speculative Decoding Primer
12:55 How TurboLoRA Works
16:40 Audience Q&A: The Origins of Lorax
19:00 Other Architectural Considerations for Multi-LoRA Serving
21:05 The Best Use Cases for Speculative Decoding
23:30 How to Balance Cost and Latency
25:20 The "Truth" about MOE Architecture and Fine-Tuning
27:15 How to Choose Between Fine-Tuning or Frontier Models
31:40 The Role of the Scheduler
34:00 The Evolving Definition of "Efficiency"
36:50 Prioritizing Optimizations: Quantization & Prefix Caching
41:40 The Importance of Staged Rollouts and Shadow Traffic
43:55 Travis's Hot Take on the Future of Inference
45:30 Wrap Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)