Learn vLLM: Improving throughput with max-num-batched-token on deepseek R1 8B running on a single L4
Learn about the --max-num-batched-token as we deploy Deepseek R1 8B using vLLM on a single L4 GPU. We run a benchmark with and without the argument to see how much of a performance gain we could get.
KubeAI was used to deploy on K8s: https://github.com/substratusai/kubeai
This is a follow up from previous video where we learned how to get Deepseek R1 8b running: https://youtu.be/-l-YhlD4geU
Watch on YouTube ↗
(saves to browser)
DeepCamp AI