Deploying Mixtral on GKE with just 2 x 24 GB L4 GPUs

Name: Deploying Mixtral on GKE with just 2 x 24 GB L4 GPUs
Uploaded: 2024-02-15T02:51:24+00:00
Channel: Samos123
Description: Lingo, open source ML Proxy and autoscaler for K8s: https://github.com/substratusai/lingo Blog post with copy pasteable instructions: https://www.substr...

Samos123 · Beginner ·🧠 Large Language Models ·2y ago

Lingo, open source ML Proxy and autoscaler for K8s: https://github.com/substratusai/lingo Blog post with copy pasteable instructions: https://www.substratus.ai/blog/deploying-mixtral-gptq-on-gke-l4-gpus Learn how to deploy Mixtral on GKE using just 2 x L4 24 GB GPUs. We do this by using GPTQ which loads Mixtral on 4 bit mode. 0:00 - Introduction 0:12 - Calculating GPU memory required for Mixtral with GPTQ 1:40 - High-level overview of the steps to deploy Mixtral on GKE 2:20 - Create GKE cluster with L4 GPU nodepool 3:35 - Download the Mixtral model weights to PVC using K8s job 5:45 - Deploy…

Watch on YouTube ↗ (saves to browser)

Chapters (7)

Introduction

0:12 Calculating GPU memory required for Mixtral with GPTQ

1:40 High-level overview of the steps to deploy Mixtral on GKE

2:20 Create GKE cluster with L4 GPU nodepool

3:35 Download the Mixtral model weights to PVC using K8s job

5:45 Deploy Mixtral using the Helm vLLM chart

9:19 Validate Mixtral is up and running send a prompt

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)