Deploy open models with TGI on Cloud Run

Google Cloud Tech · Beginner ·☁️ DevOps & Cloud ·1y ago

Key Takeaways

The video demonstrates how to deploy the Gemma 2 model on Cloud Run using the Hugging Face TGI, leveraging Cloud Run's GPU capabilities for fast token speed and serverless scaling. The tutorial covers setting environment variables, creating a cloudnet router, deploying the service, and sending requests to the inference API.

Full Transcript

[Music] hi I'm we from Google cloud and I'm Alvaro from hiking face we'll show you how to take the gamma 2 model weights and serve them on cloud run using the hugen phase TGI Cloud run uh recently included gpus which will make the inference go fast oh that's right uh and it's truly pay for use because Club run scills down to zero in when there are no incoming requests so Alo you wrote a tutorial on this right yeah that's right about serving Gemma 2 quantized on cloud run that's awesome can you walk me through yeah for sure so first of all you will need to start like setting a few environment variables wait wait wait hold on um I'll first go to the Google Cloud Web console and then navigate to Cloud run okay right there and I'll start Cloud shell so I have a terminal to work with that's perfect now tell me what to do exactly so now you can already copy the environment variables but you will need to substitute the project ID with your actual project ID okay okay let's do that typing is hard okay yes I've got it um now what's the next bit I'm already authenticated uh I did enable Cloud run uh and now first you want me to create a cloudnet router so let's do that but but why so in this case we are sending external traffic uh to the hiking face Hub specifically and uh in order to speed it up we are create creating the the router okay okay good okay and then there is a very long gcloud command to deploy the service that's right let's start that all take a while so while that's running maybe you can walk me through the settings right I know Cloud run I I know what this does right gcloud better run deploy give it the service name and an image so what container image are we deploying in this case we are going to be using the Hing phase uh deep learning container for text generation inference uh which is uploaded within the Google Cloud ah okay okay I made another video on the hugen face deep learning containers uh check out the link in the description now this second argument it specifies the arguments to the TGI container it sets a model so what's the huging quants thing exactly so in this case uh we are using a Quantas version of Gemma 2 which is quantized from B FL 16 to in4 using awq and high in Quant is basically the organization that we maintain where we upload these uh quantization models okay okay and you reduce those model weights uh the Precision right because you want to have faster performance exactly okay and you also set max concurrent requests of TGI to 64 so why 64 so in order to determine this value we basically run the tech generation Benchmark which is a tool we have within Tech DGI that basically uh finds the best trade-off between through output and latency and in this case we decided that 64 was the best value for that vure mhm okay so set it higher you get better GPU utilization set it lower you get faster token speed yeah right okay good okay so I also need to set an environment value uh variable what what's HF Hub enable HF transfer so we have an internal tool which is called HF transfer that basically speeds up the download process from the hiking face Hub in this case we are setting it to one in order to speed it up okay and this needs to be fast right because every container instance that starts needs to download this model file and it can be around 6 GB yeah that's right okay okay so this I know Port set it to 880 that's where TGI listens uh allocate 8 vcpus and 32 gabt of ram I allocate a GPU you get one GPU per container instance it's an Nvidia L4 with 24 GB of vram uh but you have many instances per service because of out of scaling uh you cap the maximum number of instan to three and set concurrency to 64 again because Cloud runs concurrency uh is if it matches uh TGI's maximum concurrency you won't get any queing on the instances so that's great I set a region uh to use Central one in this example but it can be any region where Cloud run GPU is supported we said uh it to be a private endpoint right so you need any uh Cloud IM authentication identity token to invoke the thing and that's good right because you don't want to have a public inference API that doesn't have uh authentication and then finally if you see ER all traffic uh this is how to send all the traffic from the instance to the VPC and then out through cloudnet router because you want to get the the good throughput to the to the phas up that's right okay well let's check back on that deploy it took about 10 minutes so how do I how do you suggest I send a request to it now so in Cloud R you will have like multiple options but in this case why don't you just use the cloud run developer proxy to expose the DGI server to Local Host and then just send request fee oh that's an awesome idea let's let's do that so I'll find the command that starts the local developer proxy and paste that into my CL cloudshell instance then I'll start a second tab okay so this starts the proxy Local Host 880 so now I can open a second tab in my in my cloud shell and and send an inference request so this command sends a prompt to the inference API and there you go I now know what deep learning is that's awesome thanks JMA now keep in mind that when you're scaling from zero the first request is a bit slower because it has to start the container instance download the model uh but after that it will be faster right yeah exactly because the server starts listening on the port 880 in this case once the model has been loaded and after the warmup okay so and in your tutorial you also show how to send requests from a python app right yeah exactly so since uh we are using we are serving Tech generation inference we have the hugging faceap python SDK which basically is has an API that's compatible with tech generation inference meaning that you can use that client also to send request to TGI programmatically via python awesome okay to wrap up we deployed Gemma 2 an open large language model to Cloud run with serverless gpus using hugg andface TGI you can find the link to the tutorial that we used in the description thanks for watching thank you he heyy [Music]

Original Description

Tutorial: How to deploy Gemma 2 on Cloud Run with TGI → https://goo.gle/3Yoztjh Get started with Cloud Run GPU → https://goo.gle/4ec7mJS Docs: Text Generation Inference → https://goo.gle/4e7qusz Start serving text generation inference with fast token speed and serve requests for a fraction of the cost of traditional methods. Watch along and learn how to deploy the Gemma 2 model to Cloud Run using Hugging Face TGI with Wietse Venema (Google) and Alvaro Bartolome (Hugging Face). More resources: Gemma 2 (9b) on the Hugging Face Hub → https://goo.gle/3C1vX6R Hugging Face Deep Learning Containers for Google Cloud → https://goo.gle/3BPaYUM Watch more Google Cloud: Building with Hugging Face → https://goo.gle/BuildWithHuggingFace Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech Speakers: Wietse Venema, Alvaro Bartolome Products Mentioned: Gemma, Hugging Face, Cloud Run #GoogleCloud #HuggingFace
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Google Cloud Tech · Google Cloud Tech · 0 of 60

← Previous Next →
1 I’m going for it #GoogleCloudCertified
I’m going for it #GoogleCloudCertified
Google Cloud Tech
2 I had to get #GoogleCloudCertified
I had to get #GoogleCloudCertified
Google Cloud Tech
3 Be better overall at what you do #GoogleCloudCertified
Be better overall at what you do #GoogleCloudCertified
Google Cloud Tech
4 Cloud Monitoring on our radar #Analysis #Uptime
Cloud Monitoring on our radar #Analysis #Uptime
Google Cloud Tech
5 Introduction to Generative AI Studio
Introduction to Generative AI Studio
Google Cloud Tech
6 How to use Github Actions with Google's Workload Identity Federation
How to use Github Actions with Google's Workload Identity Federation
Google Cloud Tech
7 Introduction to Responsible AI
Introduction to Responsible AI
Google Cloud Tech
8 Networking updates and CDMC-certified architecture
Networking updates and CDMC-certified architecture
Google Cloud Tech
9 Create and use a Cloud Storage bucket
Create and use a Cloud Storage bucket
Google Cloud Tech
10 How to digitize text from documents
How to digitize text from documents
Google Cloud Tech
11 Faster analytical queries with AlloyDB
Faster analytical queries with AlloyDB
Google Cloud Tech
12 Next ‘23 sessions and FaaS Wave
Next ‘23 sessions and FaaS Wave
Google Cloud Tech
13 Introduction to Assured Open Source Software
Introduction to Assured Open Source Software
Google Cloud Tech
14 BigQuery Cost Optimization: Storage
BigQuery Cost Optimization: Storage
Google Cloud Tech
15 BigQuery Cost Optimization: Compute
BigQuery Cost Optimization: Compute
Google Cloud Tech
16 BigQuery Cost Optimization: Select Queries
BigQuery Cost Optimization: Select Queries
Google Cloud Tech
17 Remote Field Equipment Management with Manufacturing Data Engine
Remote Field Equipment Management with Manufacturing Data Engine
Google Cloud Tech
18 Supercharging your applications with Cloud SQL Enterprise Plus
Supercharging your applications with Cloud SQL Enterprise Plus
Google Cloud Tech
19 Vector Support on our radar #GenAI
Vector Support on our radar #GenAI
Google Cloud Tech
20 Architecting a blockchain startup with Google Cloud
Architecting a blockchain startup with Google Cloud
Google Cloud Tech
21 Kubernetes and multitasking updates!
Kubernetes and multitasking updates!
Google Cloud Tech
22 GKE: Using Kubernetes Events
GKE: Using Kubernetes Events
Google Cloud Tech
23 How to configure firewall rules for Cloud Composer
How to configure firewall rules for Cloud Composer
Google Cloud Tech
24 Vertex AI Embeddings API + Matching Engine: Grounding LLMs made easy
Vertex AI Embeddings API + Matching Engine: Grounding LLMs made easy
Google Cloud Tech
25 Geospatial analytics on our radar #EarthEngine #BigQuery
Geospatial analytics on our radar #EarthEngine #BigQuery
Google Cloud Tech
26 Ensuring requests are set in Kubernetes
Ensuring requests are set in Kubernetes
Google Cloud Tech
27 Cloud Next 2023, Google research program, and more!
Cloud Next 2023, Google research program, and more!
Google Cloud Tech
28 How to migrate projects between organizations with Resource Manager
How to migrate projects between organizations with Resource Manager
Google Cloud Tech
29 How to run #MySQL in Google Cloud
How to run #MySQL in Google Cloud
Google Cloud Tech
30 #GenerativeAI for enterprises and #Next2023
#GenerativeAI for enterprises and #Next2023
Google Cloud Tech
31 How Google Photos scales to store 4 trillion photos and videos
How Google Photos scales to store 4 trillion photos and videos
Google Cloud Tech
32 Google Cross-Cloud Interconnect (Demo 2)
Google Cross-Cloud Interconnect (Demo 2)
Google Cloud Tech
33 GKE Cost Optimization Golden Signals: Introduction
GKE Cost Optimization Golden Signals: Introduction
Google Cloud Tech
34 GKE Cost Optimization Golden Signals: Workload Rightsizing
GKE Cost Optimization Golden Signals: Workload Rightsizing
Google Cloud Tech
35 GKE Load Balancing: Overview
GKE Load Balancing: Overview
Google Cloud Tech
36 GKE Load Balancing: Best Practices
GKE Load Balancing: Best Practices
Google Cloud Tech
37 Disaster Recovery in GKE
Disaster Recovery in GKE
Google Cloud Tech
38 How to configure IP masquerade agent in GKE Standard clusters
How to configure IP masquerade agent in GKE Standard clusters
Google Cloud Tech
39 Enable and use GKE Control plane logs
Enable and use GKE Control plane logs
Google Cloud Tech
40 Compliance in Australia with Assured Workloads
Compliance in Australia with Assured Workloads
Google Cloud Tech
41 Creating budgets and budget alerts in Google Cloud #FinOps
Creating budgets and budget alerts in Google Cloud #FinOps
Google Cloud Tech
42 Cloud SQL Enterprise Plus on our radar #mySQL
Cloud SQL Enterprise Plus on our radar #mySQL
Google Cloud Tech
43 What's Next for Google Cloud?
What's Next for Google Cloud?
Google Cloud Tech
44 How Loveholidays scaled with Contact Center AI
How Loveholidays scaled with Contact Center AI
Google Cloud Tech
45 What is fleet team management in GKE?
What is fleet team management in GKE?
Google Cloud Tech
46 Troubleshoot VPC Network Peering
Troubleshoot VPC Network Peering
Google Cloud Tech
47 Introduction to DocAI and Contact Center AI
Introduction to DocAI and Contact Center AI
Google Cloud Tech
48 Cloud Run Direct VPC egress explained
Cloud Run Direct VPC egress explained
Google Cloud Tech
49 Database deployment options in GKE
Database deployment options in GKE
Google Cloud Tech
50 Analyze cloud billing data with #BigQuery
Analyze cloud billing data with #BigQuery
Google Cloud Tech
51 Tips to becoming a world-class Prompt Engineer
Tips to becoming a world-class Prompt Engineer
Google Cloud Tech
52 Serverless is simple. Do I need CI/CD?
Serverless is simple. Do I need CI/CD?
Google Cloud Tech
53 Accelerating model deployment with MLOps
Accelerating model deployment with MLOps
Google Cloud Tech
54 How Hawaii's Department of Human Services scaled with CCAI
How Hawaii's Department of Human Services scaled with CCAI
Google Cloud Tech
55 Pricing API on our #Radar
Pricing API on our #Radar
Google Cloud Tech
56 How Recommendations AI for Media can boost customer retention
How Recommendations AI for Media can boost customer retention
Google Cloud Tech
57 Troubleshooting: Node Not Ready Status
Troubleshooting: Node Not Ready Status
Google Cloud Tech
58 One weekend until Cloud Next 2023!
One weekend until Cloud Next 2023!
Google Cloud Tech
59 #GoogleCloudNext starts tomorrow!
#GoogleCloudNext starts tomorrow!
Google Cloud Tech
60 #GoogleCloudNext will be demand!
#GoogleCloudNext will be demand!
Google Cloud Tech

This video teaches how to deploy LLMs on Cloud Run using Hugging Face TGI, leveraging serverless GPUs for fast inference. The tutorial covers setting up the environment, deploying the service, and sending requests to the inference API.

Key Takeaways
  1. Set environment variables
  2. Create a cloudnet router
  3. Deploy the service using gcloud command
  4. Send requests to the inference API using Cloud Run developer proxy
💡 Using serverless GPUs on Cloud Run can significantly speed up token speed and reduce costs for LLM inference.

Related AI Lessons

`wrangler dev --remote` silently writes to your production KV namespace — here's the fix
Learn how to safely use wrangler dev --remote with live KV namespaces without overwriting production data
Dev.to · 강해수
Qwen 3.6 27B Is the Local Dev Sweet Spot — Here's Why
Discover why Qwen 3.6 27B is the ideal choice for local development, and how it can boost your productivity
Dev.to · Carter May
Deploying Spring Petclinic Microservices with Docker Compose: An End-to-End DevOps Deployment Experience
Learn to deploy Spring Petclinic microservices with Docker Compose for a seamless DevOps experience
Dev.to · Nice Nwogu
Qwen 3.6 27B Is the Local Dev Sweet Spot — Here's Why
Discover why Qwen 3.6 27B is the ideal choice for local development, offering a sweet spot for efficiency and performance
Dev.to · Carter May
Up next
Containers on Amazon ECS with Mama J
AWS Developers
Watch →