Deploy open models with TGI on Cloud Run
Key Takeaways
The video demonstrates how to deploy the Gemma 2 model on Cloud Run using the Hugging Face TGI, leveraging Cloud Run's GPU capabilities for fast token speed and serverless scaling. The tutorial covers setting environment variables, creating a cloudnet router, deploying the service, and sending requests to the inference API.
Full Transcript
[Music] hi I'm we from Google cloud and I'm Alvaro from hiking face we'll show you how to take the gamma 2 model weights and serve them on cloud run using the hugen phase TGI Cloud run uh recently included gpus which will make the inference go fast oh that's right uh and it's truly pay for use because Club run scills down to zero in when there are no incoming requests so Alo you wrote a tutorial on this right yeah that's right about serving Gemma 2 quantized on cloud run that's awesome can you walk me through yeah for sure so first of all you will need to start like setting a few environment variables wait wait wait hold on um I'll first go to the Google Cloud Web console and then navigate to Cloud run okay right there and I'll start Cloud shell so I have a terminal to work with that's perfect now tell me what to do exactly so now you can already copy the environment variables but you will need to substitute the project ID with your actual project ID okay okay let's do that typing is hard okay yes I've got it um now what's the next bit I'm already authenticated uh I did enable Cloud run uh and now first you want me to create a cloudnet router so let's do that but but why so in this case we are sending external traffic uh to the hiking face Hub specifically and uh in order to speed it up we are create creating the the router okay okay good okay and then there is a very long gcloud command to deploy the service that's right let's start that all take a while so while that's running maybe you can walk me through the settings right I know Cloud run I I know what this does right gcloud better run deploy give it the service name and an image so what container image are we deploying in this case we are going to be using the Hing phase uh deep learning container for text generation inference uh which is uploaded within the Google Cloud ah okay okay I made another video on the hugen face deep learning containers uh check out the link in the description now this second argument it specifies the arguments to the TGI container it sets a model so what's the huging quants thing exactly so in this case uh we are using a Quantas version of Gemma 2 which is quantized from B FL 16 to in4 using awq and high in Quant is basically the organization that we maintain where we upload these uh quantization models okay okay and you reduce those model weights uh the Precision right because you want to have faster performance exactly okay and you also set max concurrent requests of TGI to 64 so why 64 so in order to determine this value we basically run the tech generation Benchmark which is a tool we have within Tech DGI that basically uh finds the best trade-off between through output and latency and in this case we decided that 64 was the best value for that vure mhm okay so set it higher you get better GPU utilization set it lower you get faster token speed yeah right okay good okay so I also need to set an environment value uh variable what what's HF Hub enable HF transfer so we have an internal tool which is called HF transfer that basically speeds up the download process from the hiking face Hub in this case we are setting it to one in order to speed it up okay and this needs to be fast right because every container instance that starts needs to download this model file and it can be around 6 GB yeah that's right okay okay so this I know Port set it to 880 that's where TGI listens uh allocate 8 vcpus and 32 gabt of ram I allocate a GPU you get one GPU per container instance it's an Nvidia L4 with 24 GB of vram uh but you have many instances per service because of out of scaling uh you cap the maximum number of instan to three and set concurrency to 64 again because Cloud runs concurrency uh is if it matches uh TGI's maximum concurrency you won't get any queing on the instances so that's great I set a region uh to use Central one in this example but it can be any region where Cloud run GPU is supported we said uh it to be a private endpoint right so you need any uh Cloud IM authentication identity token to invoke the thing and that's good right because you don't want to have a public inference API that doesn't have uh authentication and then finally if you see ER all traffic uh this is how to send all the traffic from the instance to the VPC and then out through cloudnet router because you want to get the the good throughput to the to the phas up that's right okay well let's check back on that deploy it took about 10 minutes so how do I how do you suggest I send a request to it now so in Cloud R you will have like multiple options but in this case why don't you just use the cloud run developer proxy to expose the DGI server to Local Host and then just send request fee oh that's an awesome idea let's let's do that so I'll find the command that starts the local developer proxy and paste that into my CL cloudshell instance then I'll start a second tab okay so this starts the proxy Local Host 880 so now I can open a second tab in my in my cloud shell and and send an inference request so this command sends a prompt to the inference API and there you go I now know what deep learning is that's awesome thanks JMA now keep in mind that when you're scaling from zero the first request is a bit slower because it has to start the container instance download the model uh but after that it will be faster right yeah exactly because the server starts listening on the port 880 in this case once the model has been loaded and after the warmup okay so and in your tutorial you also show how to send requests from a python app right yeah exactly so since uh we are using we are serving Tech generation inference we have the hugging faceap python SDK which basically is has an API that's compatible with tech generation inference meaning that you can use that client also to send request to TGI programmatically via python awesome okay to wrap up we deployed Gemma 2 an open large language model to Cloud run with serverless gpus using hugg andface TGI you can find the link to the tutorial that we used in the description thanks for watching thank you he heyy [Music]
Original Description
Tutorial: How to deploy Gemma 2 on Cloud Run with TGI → https://goo.gle/3Yoztjh
Get started with Cloud Run GPU → https://goo.gle/4ec7mJS
Docs: Text Generation Inference → https://goo.gle/4e7qusz
Start serving text generation inference with fast token speed and serve requests for a fraction of the cost of traditional methods. Watch along and learn how to deploy the Gemma 2 model to Cloud Run using Hugging Face TGI with Wietse Venema (Google) and Alvaro Bartolome (Hugging Face).
More resources:
Gemma 2 (9b) on the Hugging Face Hub → https://goo.gle/3C1vX6R
Hugging Face Deep Learning Containers for Google Cloud → https://goo.gle/3BPaYUM
Watch more Google Cloud: Building with Hugging Face → https://goo.gle/BuildWithHuggingFace
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
Speakers: Wietse Venema, Alvaro Bartolome
Products Mentioned: Gemma, Hugging Face, Cloud Run
#GoogleCloud #HuggingFace
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Google Cloud Tech · Google Cloud Tech · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
I’m going for it #GoogleCloudCertified
Google Cloud Tech
I had to get #GoogleCloudCertified
Google Cloud Tech
Be better overall at what you do #GoogleCloudCertified
Google Cloud Tech
Cloud Monitoring on our radar #Analysis #Uptime
Google Cloud Tech
Introduction to Generative AI Studio
Google Cloud Tech
How to use Github Actions with Google's Workload Identity Federation
Google Cloud Tech
Introduction to Responsible AI
Google Cloud Tech
Networking updates and CDMC-certified architecture
Google Cloud Tech
Create and use a Cloud Storage bucket
Google Cloud Tech
How to digitize text from documents
Google Cloud Tech
Faster analytical queries with AlloyDB
Google Cloud Tech
Next ‘23 sessions and FaaS Wave
Google Cloud Tech
Introduction to Assured Open Source Software
Google Cloud Tech
BigQuery Cost Optimization: Storage
Google Cloud Tech
BigQuery Cost Optimization: Compute
Google Cloud Tech
BigQuery Cost Optimization: Select Queries
Google Cloud Tech
Remote Field Equipment Management with Manufacturing Data Engine
Google Cloud Tech
Supercharging your applications with Cloud SQL Enterprise Plus
Google Cloud Tech
Vector Support on our radar #GenAI
Google Cloud Tech
Architecting a blockchain startup with Google Cloud
Google Cloud Tech
Kubernetes and multitasking updates!
Google Cloud Tech
GKE: Using Kubernetes Events
Google Cloud Tech
How to configure firewall rules for Cloud Composer
Google Cloud Tech
Vertex AI Embeddings API + Matching Engine: Grounding LLMs made easy
Google Cloud Tech
Geospatial analytics on our radar #EarthEngine #BigQuery
Google Cloud Tech
Ensuring requests are set in Kubernetes
Google Cloud Tech
Cloud Next 2023, Google research program, and more!
Google Cloud Tech
How to migrate projects between organizations with Resource Manager
Google Cloud Tech
How to run #MySQL in Google Cloud
Google Cloud Tech
#GenerativeAI for enterprises and #Next2023
Google Cloud Tech
How Google Photos scales to store 4 trillion photos and videos
Google Cloud Tech
Google Cross-Cloud Interconnect (Demo 2)
Google Cloud Tech
GKE Cost Optimization Golden Signals: Introduction
Google Cloud Tech
GKE Cost Optimization Golden Signals: Workload Rightsizing
Google Cloud Tech
GKE Load Balancing: Overview
Google Cloud Tech
GKE Load Balancing: Best Practices
Google Cloud Tech
Disaster Recovery in GKE
Google Cloud Tech
How to configure IP masquerade agent in GKE Standard clusters
Google Cloud Tech
Enable and use GKE Control plane logs
Google Cloud Tech
Compliance in Australia with Assured Workloads
Google Cloud Tech
Creating budgets and budget alerts in Google Cloud #FinOps
Google Cloud Tech
Cloud SQL Enterprise Plus on our radar #mySQL
Google Cloud Tech
What's Next for Google Cloud?
Google Cloud Tech
How Loveholidays scaled with Contact Center AI
Google Cloud Tech
What is fleet team management in GKE?
Google Cloud Tech
Troubleshoot VPC Network Peering
Google Cloud Tech
Introduction to DocAI and Contact Center AI
Google Cloud Tech
Cloud Run Direct VPC egress explained
Google Cloud Tech
Database deployment options in GKE
Google Cloud Tech
Analyze cloud billing data with #BigQuery
Google Cloud Tech
Tips to becoming a world-class Prompt Engineer
Google Cloud Tech
Serverless is simple. Do I need CI/CD?
Google Cloud Tech
Accelerating model deployment with MLOps
Google Cloud Tech
How Hawaii's Department of Human Services scaled with CCAI
Google Cloud Tech
Pricing API on our #Radar
Google Cloud Tech
How Recommendations AI for Media can boost customer retention
Google Cloud Tech
Troubleshooting: Node Not Ready Status
Google Cloud Tech
One weekend until Cloud Next 2023!
Google Cloud Tech
#GoogleCloudNext starts tomorrow!
Google Cloud Tech
#GoogleCloudNext will be demand!
Google Cloud Tech
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
`wrangler dev --remote` silently writes to your production KV namespace — here's the fix
Dev.to · 강해수
Qwen 3.6 27B Is the Local Dev Sweet Spot — Here's Why
Dev.to · Carter May
Deploying Spring Petclinic Microservices with Docker Compose: An End-to-End DevOps Deployment Experience
Dev.to · Nice Nwogu
Qwen 3.6 27B Is the Local Dev Sweet Spot — Here's Why
Dev.to · Carter May
🎓
Tutor Explanation
DeepCamp AI