Learn vLLM: Improving throughput with max-num-batched-token on deepseek R1 8B running on a single L4

Name: Learn vLLM: Improving throughput with max-num-batched-token on deepseek R1 8B running on a single L4
Uploaded: 2025-02-15T20:36:33+00:00
Channel: Samos123
Description: Learn about the --max-num-batched-token as we deploy Deepseek R1 8B using vLLM on a single L4 GPU. We run a benchmark with and without the argument to s...

Samos123 · Beginner ·🧠 Large Language Models ·1y ago

Learn about the --max-num-batched-token as we deploy Deepseek R1 8B using vLLM on a single L4 GPU. We run a benchmark with and without the argument to see how much of a performance gain we could get. KubeAI was used to deploy on K8s: https://github.com/substratusai/kubeai This is a follow up from previous video where we learned how to get Deepseek R1 8b running: https://youtu.be/-l-YhlD4geU

Watch on YouTube ↗ (saves to browser)

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)