Deploying Mixtral on GKE with just 2 x 24 GB L4 GPUs

Samos123 · Beginner ·🧠 Large Language Models ·2y ago
Lingo, open source ML Proxy and autoscaler for K8s: https://github.com/substratusai/lingo Blog post with copy pasteable instructions: https://www.substratus.ai/blog/deploying-mixtral-gptq-on-gke-l4-gpus Learn how to deploy Mixtral on GKE using just 2 x L4 24 GB GPUs. We do this by using GPTQ which loads Mixtral on 4 bit mode. 0:00 - Introduction 0:12 - Calculating GPU memory required for Mixtral with GPTQ 1:40 - High-level overview of the steps to deploy Mixtral on GKE 2:20 - Create GKE cluster with L4 GPU nodepool 3:35 - Download the Mixtral model weights to PVC using K8s job 5:45 - Deploy…
Watch on YouTube ↗ (saves to browser)

Chapters (7)

Introduction
0:12 Calculating GPU memory required for Mixtral with GPTQ
1:40 High-level overview of the steps to deploy Mixtral on GKE
2:20 Create GKE cluster with L4 GPU nodepool
3:35 Download the Mixtral model weights to PVC using K8s job
5:45 Deploy Mixtral using the Helm vLLM chart
9:19 Validate Mixtral is up and running send a prompt
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)