The Future of Efficient LLM Serving: A Deep Dive with Travis Adair l Predibase

Name: The Future of Efficient LLM Serving: A Deep Dive with Travis Adair l Predibase
Uploaded: 2025-08-20T21:30:19+00:00
Channel: Predibase by Rubrik
Description: 🔔 SUBSCRIBE for the latest on LLM fine-tuning, AI scaling, and reinforcement learning hacks! 👉https://www.youtube.com/@Predibase 🔗 Try Predibase’s R...

Predibase by Rubrik · Intermediate ·🧠 Large Language Models ·7mo ago

🔔 SUBSCRIBE for the latest on LLM fine-tuning, AI scaling, and reinforcement learning hacks! 👉https://www.youtube.com/@Predibase 🔗 Try Predibase’s RFT Platform: https://predibase.com/free-trial 👉 Schedule a live demo: https://pbase.ai/41FZKfy Discover how Predibase's serving stack, featuring Lorax (LoRA Exchange) and TurboLoRA, helps organizations overcome common inference bottlenecks. Travis explains key concepts like: ✅ Mult-LoRA Serving. Efficiently serving dozens or even hundreds of fine-tuned adapters on a single GPU. ✅ Speculative Decoding. Drastically improving inference speed b…

Watch on YouTube ↗ (saves to browser)

Chapters (18)

Intro to Let's Talk Tokens

1:25 Travis's Primer: An Overview of the Predibase Serving Stack

3:40 The Motivation Behind Lorax

6:50 The Problem with One-Size-Fits-All Models

8:45 The Speculative Decoding Primer

12:55 How TurboLoRA Works

16:40 Audience Q&A: The Origins of Lorax

19:00 Other Architectural Considerations for Multi-LoRA Serving

21:05 The Best Use Cases for Speculative Decoding

23:30 How to Balance Cost and Latency

25:20 The "Truth" about MOE Architecture and Fine-Tuning

27:15 How to Choose Between Fine-Tuning or Frontier Models

31:40 The Role of the Scheduler

34:00 The Evolving Definition of "Efficiency"

36:50 Prioritizing Optimizations: Quantization & Prefix Caching

41:40 The Importance of Staged Rollouts and Shadow Traffic

43:55 Travis's Hot Take on the Future of Inference

45:30 Wrap Up

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)