Deploying Mixtral on GKE with just 2 x 24 GB L4 GPUs
Lingo, open source ML Proxy and autoscaler for K8s: https://github.com/substratusai/lingo
Blog post with copy pasteable instructions: https://www.substratus.ai/blog/deploying-mixtral-gptq-on-gke-l4-gpus
Learn how to deploy Mixtral on GKE using just 2 x L4 24 GB GPUs. We do this by using GPTQ which loads Mixtral on 4 bit mode.
0:00 - Introduction
0:12 - Calculating GPU memory required for Mixtral with GPTQ
1:40 - High-level overview of the steps to deploy Mixtral on GKE
2:20 - Create GKE cluster with L4 GPU nodepool
3:35 - Download the Mixtral model weights to PVC using K8s job
5:45 - Deploy…
Watch on YouTube ↗
(saves to browser)
Chapters (7)
Introduction
0:12
Calculating GPU memory required for Mixtral with GPTQ
1:40
High-level overview of the steps to deploy Mixtral on GKE
2:20
Create GKE cluster with L4 GPU nodepool
3:35
Download the Mixtral model weights to PVC using K8s job
5:45
Deploy Mixtral using the Helm vLLM chart
9:19
Validate Mixtral is up and running send a prompt
DeepCamp AI