The Future of Efficient LLM Serving: A Deep Dive with Travis Adair l Predibase
๐ SUBSCRIBE for the latest on LLM fine-tuning, AI scaling, and reinforcement learning hacks!
๐https://www.youtube.com/@Predibase
๐ Try Predibaseโs RFT Platform: https://predibase.com/free-trial
๐ Schedule a live demo: https://pbase.ai/41FZKfy
Discover how Predibase's serving stack, featuring Lorax (LoRA Exchange) and TurboLoRA, helps organizations overcome common inference bottlenecks. Travis explains key concepts like:
โ
Mult-LoRA Serving. Efficiently serving dozens or even hundreds of fine-tuned adapters on a single GPU.
โ
Speculative Decoding. Drastically improving inference speed bโฆ
Watch on YouTube โ
(saves to browser)
Chapters (18)
Intro to Let's Talk Tokens
1:25
Travis's Primer: An Overview of the Predibase Serving Stack
3:40
The Motivation Behind Lorax
6:50
The Problem with One-Size-Fits-All Models
8:45
The Speculative Decoding Primer
12:55
How TurboLoRA Works
16:40
Audience Q&A: The Origins of Lorax
19:00
Other Architectural Considerations for Multi-LoRA Serving
21:05
The Best Use Cases for Speculative Decoding
23:30
How to Balance Cost and Latency
25:20
The "Truth" about MOE Architecture and Fine-Tuning
27:15
How to Choose Between Fine-Tuning or Frontier Models
31:40
The Role of the Scheduler
34:00
The Evolving Definition of "Efficiency"
36:50
Prioritizing Optimizations: Quantization & Prefix Caching
41:40
The Importance of Staged Rollouts and Shadow Traffic
43:55
Travis's Hot Take on the Future of Inference
45:30
Wrap Up
DeepCamp AI