Deploy open models with TGI on Cloud Run

Google Cloud Tech · Beginner ·☁️ DevOps & Cloud ·1y ago

Skills: LLM Foundations80%LLM Engineering70%

Key Takeaways

The video demonstrates how to deploy the Gemma 2 model on Cloud Run using the Hugging Face TGI, leveraging Cloud Run's GPU capabilities for fast token speed and serverless scaling. The tutorial covers setting environment variables, creating a cloudnet router, deploying the service, and sending requests to the inference API.

Full Transcript

[Music] hi I'm we from Google cloud and I'm Alvaro from hiking face we'll show you how to take the gamma 2 model weights and serve them on cloud run using the hugen phase TGI Cloud run uh recently included gpus which will make the inference go fast oh that's right uh and it's truly pay for use because Club run scills down to zero in when there are no incoming requests so Alo you wrote a tutorial on this right yeah that's right about serving Gemma 2 quantized on cloud run that's awesome can you walk me through yeah for sure so first of all you will need to start like setting a few environment variables wait wait wait hold on um I'll first go to the Google Cloud Web console and then navigate to Cloud run okay right there and I'll start Cloud shell so I have a terminal to work with that's perfect now tell me what to do exactly so now you can already copy the environment variables but you will need to substitute the project ID with your actual project ID okay okay let's do that typing is hard okay yes I've got it um now what's the next bit I'm already authenticated uh I did enable Cloud run uh and now first you want me to create a cloudnet router so let's do that but but why so in this case we are sending external traffic uh to the hiking face Hub specifically and uh in order to speed it up we are create creating the the router okay okay good okay and then there is a very long gcloud command to deploy the service that's right let's start that all take a while so while that's running maybe you can walk me through the settings right I know Cloud run I I know what this does right gcloud better run deploy give it the service name and an image so what container image are we deploying in this case we are going to be using the Hing phase uh deep learning container for text generation inference uh which is uploaded within the Google Cloud ah okay okay I made another video on the hugen face deep learning containers uh check out the link in the description now this second argument it specifies the arguments to the TGI container it sets a model so what's the huging quants thing exactly so in this case uh we are using a Quantas version of Gemma 2 which is quantized from B FL 16 to in4 using awq and high in Quant is basically the organization that we maintain where we upload these uh quantization models okay okay and you reduce those model weights uh the Precision right because you want to have faster performance exactly okay and you also set max concurrent requests of TGI to 64 so why 64 so in order to determine this value we basically run the tech generation Benchmark which is a tool we have within Tech DGI that basically uh finds the best trade-off between through output and latency and in this case we decided that 64 was the best value for that vure mhm okay so set it higher you get better GPU utilization set it lower you get faster token speed yeah right okay good okay so I also need to set an environment value uh variable what what's HF Hub enable HF transfer so we have an internal tool which is called HF transfer that basically speeds up the download process from the hiking face Hub in this case we are setting it to one in order to speed it up okay and this needs to be fast right because every container instance that starts needs to download this model file and it can be around 6 GB yeah that's right okay okay so this I know Port set it to 880 that's where TGI listens uh allocate 8 vcpus and 32 gabt of ram I allocate a GPU you get one GPU per container instance it's an Nvidia L4 with 24 GB of vram uh but you have many instances per service because of out of scaling uh you cap the maximum number of instan to three and set concurrency to 64 again because Cloud runs concurrency uh is if it matches uh TGI's maximum concurrency you won't get any queing on the instances so that's great I set a region uh to use Central one in this example but it can be any region where Cloud run GPU is supported we said uh it to be a private endpoint right so you need any uh Cloud IM authentication identity token to invoke the thing and that's good right because you don't want to have a public inference API that doesn't have uh authentication and then finally if you see ER all traffic uh this is how to send all the traffic from the instance to the VPC and then out through cloudnet router because you want to get the the good throughput to the to the phas up that's right okay well let's check back on that deploy it took about 10 minutes so how do I how do you suggest I send a request to it now so in Cloud R you will have like multiple options but in this case why don't you just use the cloud run developer proxy to expose the DGI server to Local Host and then just send request fee oh that's an awesome idea let's let's do that so I'll find the command that starts the local developer proxy and paste that into my CL cloudshell instance then I'll start a second tab okay so this starts the proxy Local Host 880 so now I can open a second tab in my in my cloud shell and and send an inference request so this command sends a prompt to the inference API and there you go I now know what deep learning is that's awesome thanks JMA now keep in mind that when you're scaling from zero the first request is a bit slower because it has to start the container instance download the model uh but after that it will be faster right yeah exactly because the server starts listening on the port 880 in this case once the model has been loaded and after the warmup okay so and in your tutorial you also show how to send requests from a python app right yeah exactly so since uh we are using we are serving Tech generation inference we have the hugging faceap python SDK which basically is has an API that's compatible with tech generation inference meaning that you can use that client also to send request to TGI programmatically via python awesome okay to wrap up we deployed Gemma 2 an open large language model to Cloud run with serverless gpus using hugg andface TGI you can find the link to the tutorial that we used in the description thanks for watching thank you he heyy [Music]

Original Description

Tutorial: How to deploy Gemma 2 on Cloud Run with TGI → https://goo.gle/3Yoztjh Get started with Cloud Run GPU → https://goo.gle/4ec7mJS Docs: Text Generation Inference → https://goo.gle/4e7qusz Start serving text generation inference with fast token speed and serve requests for a fraction of the cost of traditional methods. Watch along and learn how to deploy the Gemma 2 model to Cloud Run using Hugging Face TGI with Wietse Venema (Google) and Alvaro Bartolome (Hugging Face). More resources: Gemma 2 (9b) on the Hugging Face Hub → https://goo.gle/3C1vX6R Hugging Face Deep Learning Containers for Google Cloud → https://goo.gle/3BPaYUM Watch more Google Cloud: Building with Hugging Face → https://goo.gle/BuildWithHuggingFace Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech Speakers: Wietse Venema, Alvaro Bartolome Products Mentioned: Gemma, Hugging Face, Cloud Run #GoogleCloud #HuggingFace

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google Cloud Tech · Google Cloud Tech · 0 of 60

← Previous Next →

I’m going for it #GoogleCloudCertified

I’m going for it #GoogleCloudCertified

Google Cloud Tech

I had to get #GoogleCloudCertified

I had to get #GoogleCloudCertified

Google Cloud Tech

Be better overall at what you do #GoogleCloudCertified

Be better overall at what you do #GoogleCloudCertified

Google Cloud Tech

Cloud Monitoring on our radar #Analysis #Uptime

Cloud Monitoring on our radar #Analysis #Uptime

Google Cloud Tech

Introduction to Generative AI Studio

Introduction to Generative AI Studio

Google Cloud Tech

How to use Github Actions with Google's Workload Identity Federation

How to use Github Actions with Google's Workload Identity Federation

Google Cloud Tech

Introduction to Responsible AI

Introduction to Responsible AI

Google Cloud Tech

Networking updates and CDMC-certified architecture

Networking updates and CDMC-certified architecture

Google Cloud Tech

Create and use a Cloud Storage bucket

Create and use a Cloud Storage bucket

Google Cloud Tech

How to digitize text from documents

How to digitize text from documents

Google Cloud Tech

Faster analytical queries with AlloyDB

Faster analytical queries with AlloyDB

Google Cloud Tech

Next ‘23 sessions and FaaS Wave

Next ‘23 sessions and FaaS Wave

Google Cloud Tech

Introduction to Assured Open Source Software

Introduction to Assured Open Source Software

Google Cloud Tech

BigQuery Cost Optimization: Storage

BigQuery Cost Optimization: Storage

Google Cloud Tech

BigQuery Cost Optimization: Compute

BigQuery Cost Optimization: Compute

Google Cloud Tech

BigQuery Cost Optimization: Select Queries

BigQuery Cost Optimization: Select Queries

Google Cloud Tech

Remote Field Equipment Management with Manufacturing Data Engine

Remote Field Equipment Management with Manufacturing Data Engine

Google Cloud Tech

Supercharging your applications with Cloud SQL Enterprise Plus

Supercharging your applications with Cloud SQL Enterprise Plus

Google Cloud Tech

Vector Support on our radar #GenAI

Vector Support on our radar #GenAI

Google Cloud Tech

Architecting a blockchain startup with Google Cloud

Architecting a blockchain startup with Google Cloud

Google Cloud Tech

Kubernetes and multitasking updates!

Kubernetes and multitasking updates!

Google Cloud Tech

GKE: Using Kubernetes Events

GKE: Using Kubernetes Events

Google Cloud Tech

How to configure firewall rules for Cloud Composer

How to configure firewall rules for Cloud Composer

Google Cloud Tech

Vertex AI Embeddings API + Matching Engine: Grounding LLMs made easy

Vertex AI Embeddings API + Matching Engine: Grounding LLMs made easy

Google Cloud Tech

Geospatial analytics on our radar #EarthEngine #BigQuery

Geospatial analytics on our radar #EarthEngine #BigQuery

Google Cloud Tech

Ensuring requests are set in Kubernetes

Ensuring requests are set in Kubernetes

Google Cloud Tech

Cloud Next 2023, Google research program, and more!

Cloud Next 2023, Google research program, and more!

Google Cloud Tech

How to migrate projects between organizations with Resource Manager

How to migrate projects between organizations with Resource Manager

Google Cloud Tech

How to run #MySQL in Google Cloud

How to run #MySQL in Google Cloud

Google Cloud Tech

#GenerativeAI for enterprises and #Next2023

#GenerativeAI for enterprises and #Next2023

Google Cloud Tech

How Google Photos scales to store 4 trillion photos and videos

How Google Photos scales to store 4 trillion photos and videos

Google Cloud Tech

Google Cross-Cloud Interconnect (Demo 2)

Google Cross-Cloud Interconnect (Demo 2)

Google Cloud Tech

GKE Cost Optimization Golden Signals: Introduction

GKE Cost Optimization Golden Signals: Introduction

Google Cloud Tech

GKE Cost Optimization Golden Signals: Workload Rightsizing

GKE Cost Optimization Golden Signals: Workload Rightsizing

Google Cloud Tech

GKE Load Balancing: Overview

GKE Load Balancing: Overview

Google Cloud Tech

GKE Load Balancing: Best Practices

GKE Load Balancing: Best Practices

Google Cloud Tech

Disaster Recovery in GKE

Disaster Recovery in GKE

Google Cloud Tech

How to configure IP masquerade agent in GKE Standard clusters

How to configure IP masquerade agent in GKE Standard clusters

Google Cloud Tech

Enable and use GKE Control plane logs

Enable and use GKE Control plane logs

Google Cloud Tech

Compliance in Australia with Assured Workloads

Compliance in Australia with Assured Workloads

Google Cloud Tech

Creating budgets and budget alerts in Google Cloud #FinOps

Creating budgets and budget alerts in Google Cloud #FinOps

Google Cloud Tech

Cloud SQL Enterprise Plus on our radar #mySQL

Cloud SQL Enterprise Plus on our radar #mySQL

Google Cloud Tech

What's Next for Google Cloud?

What's Next for Google Cloud?

Google Cloud Tech

How Loveholidays scaled with Contact Center AI

How Loveholidays scaled with Contact Center AI

Google Cloud Tech

What is fleet team management in GKE?

What is fleet team management in GKE?

Google Cloud Tech

Troubleshoot VPC Network Peering

Troubleshoot VPC Network Peering

Google Cloud Tech

Introduction to DocAI and Contact Center AI

Introduction to DocAI and Contact Center AI

Google Cloud Tech

Cloud Run Direct VPC egress explained

Cloud Run Direct VPC egress explained

Google Cloud Tech

Database deployment options in GKE

Database deployment options in GKE

Google Cloud Tech

Analyze cloud billing data with #BigQuery

Analyze cloud billing data with #BigQuery

Google Cloud Tech

Tips to becoming a world-class Prompt Engineer

Tips to becoming a world-class Prompt Engineer

Google Cloud Tech

Serverless is simple. Do I need CI/CD?

Serverless is simple. Do I need CI/CD?

Google Cloud Tech

Accelerating model deployment with MLOps

Accelerating model deployment with MLOps

Google Cloud Tech

How Hawaii's Department of Human Services scaled with CCAI

How Hawaii's Department of Human Services scaled with CCAI

Google Cloud Tech

Pricing API on our #Radar

Pricing API on our #Radar

Google Cloud Tech

How Recommendations AI for Media can boost customer retention

How Recommendations AI for Media can boost customer retention

Google Cloud Tech

Troubleshooting: Node Not Ready Status

Troubleshooting: Node Not Ready Status

Google Cloud Tech

One weekend until Cloud Next 2023!

One weekend until Cloud Next 2023!

Google Cloud Tech

#GoogleCloudNext starts tomorrow!

#GoogleCloudNext starts tomorrow!

Google Cloud Tech

#GoogleCloudNext will be demand!

#GoogleCloudNext will be demand!

Google Cloud Tech

This video teaches how to deploy LLMs on Cloud Run using Hugging Face TGI, leveraging serverless GPUs for fast inference. The tutorial covers setting up the environment, deploying the service, and sending requests to the inference API.

Key Takeaways

Set environment variables
Create a cloudnet router
Deploy the service using gcloud command
Send requests to the inference API using Cloud Run developer proxy

💡 Using serverless GPUs on Cloud Run can significantly speed up token speed and reduce costs for LLM inference.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

`wrangler dev --remote` silently writes to your production KV namespace — here's the fix

Learn how to safely use wrangler dev --remote with live KV namespaces without overwriting production data

Dev.to · 강해수

Qwen 3.6 27B Is the Local Dev Sweet Spot — Here's Why

Discover why Qwen 3.6 27B is the ideal choice for local development, and how it can boost your productivity

Dev.to · Carter May

Deploying Spring Petclinic Microservices with Docker Compose: An End-to-End DevOps Deployment Experience

Learn to deploy Spring Petclinic microservices with Docker Compose for a seamless DevOps experience

Dev.to · Nice Nwogu

Qwen 3.6 27B Is the Local Dev Sweet Spot — Here's Why

Discover why Qwen 3.6 27B is the ideal choice for local development, offering a sweet spot for efficiency and performance

Dev.to · Carter May

Containers on Amazon ECS with Mama J