Learn vLLM: Improving throughput with max-num-batched-token on deepseek R1 8B running on a single L4

Samos123 · Beginner ·🧠 Large Language Models ·1y ago
Learn about the --max-num-batched-token as we deploy Deepseek R1 8B using vLLM on a single L4 GPU. We run a benchmark with and without the argument to see how much of a performance gain we could get. KubeAI was used to deploy on K8s: https://github.com/substratusai/kubeai This is a follow up from previous video where we learned how to get Deepseek R1 8b running: https://youtu.be/-l-YhlD4geU
Watch on YouTube ↗ (saves to browser)
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)