FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Key Takeaways
The video demonstrates the deployment of the FLAN-T5-XXL model with 11 billion parameters on NVIDIA A100 GPU using Hugging Face Inference Endpoints, and explores the features and pricing of this cloud-based inference solution. It also showcases the ease of use and flexibility of Hugging Face Inference Endpoints for large-scale model deployment.
Full Transcript
hello Community we have a new update on flying T5 and this time we have a look at the XXL version and yes of course we're gonna run into online Nvidia E100 but you know the main question is and a lot of my viewers ask me hey what if I have a model that is too big to be deployed on this classical free Google color notebook where we have normally Nvidia T4 with about 16 gigabyte of GPU memory and I can tell you if we go now here for example to the Google find T5 X XL model we need about GPU with about 80 gigabyte of memory and we will use the mixed Precision or some sharding to fit the whole model on a single GPU here in my hugging face and I put in flan P5 XXL and here we have our Google flan T5 XXL now you can spin up here no problem but I was just going down down here you have yeah the languages those are the languages it is has been trained on and of course here you have the flan T5 checks points and the original flying T5 checkpoints yes yes yes yes if you want to read about FB 16 and 8 and everything is here but what I wanted to show you is this here maybe increase it a little bit and here the different models so last time I showed you here our parameters three billion parameter model the T5 XL and now we go with the 11 billion parameter model the T5 X XL when the flan in particular the flying T5 XXL and you have here the different benchmarks you have here 57 task 23 tasks so here a compendium of different benchmarks and you see how the models compare the flan T5 and of course Palm and you see here going from Siri billion parameters to 11 billion parameters you do have some increase here in the performance but yeah I mean yeah the real the real mover if you look here other in here at the 500 40 billion parameter model like Palm or then some derivation where you can see you really go up here in the performance indices so just to give you an order of magnitude where we're operating so we are here on a T5 XXL model before we will switch to pom plan T5 XXL I noticed there has a new one Phil Schmidt so I had a look at this Fork you can deploy the flan T5 xxx with one click we are using a quantized version okay okay create a new endpoint and if you want I have I have no idea what an hugging phase endpoint is if you want we find out together how we can run this model here on hugging phase endpoints so I got interested here in this picture you have a model repository from hugging face from the hugging face Hub that's beautiful then you choose more or less a cloud provider AWS Azure or Google Cloud and you have your Regional Hub and then you choose okay and I said okay I'll have a look at this yeah again language has everything attack points everything for you but I said okay let's start here so I said one click so let's have a look together hugging face inference endpoint welcome to inference endpoints you can easily deploy your models on dedicated fully managed infrastructure keep your costs low with secure compliant and flexible production add your credit card yippee at my credit card this is what I want to hear so we have a look at the documentation and maybe get an idea what we are talking about so endpoints two pricing plans here but what are endpoints let's go here secure Production Solutions will easily deploy any Transformer any sentence Transformer any diffusion model from The Hub from The Hiding face Hub on auto scaling infrastructure managed by hugging phase now this is nice now this is nice you know what this means this means that whatever we have where am I whatever we have here on sentence Transformers for example whatever model we have here in the hugging phase we can use this repository here and run it on some dedicated infrastructure now this is nice this is very nice okay support all Transformers sentence Transformers and diffusion tasks now this is really really nice okay so ah stable diffusion when five updated four days ago okay yeah by the way I wanted to show you flying T5 large model has been updated and the flying T5 x large model has been updated seven days ago so you see here flying T5 from Google pharmd5 large 270 000 downloads updated one day ago yeah but what I wanted to show you here that it is actually brand new what I want to show you if I'm gonna find it Phil Schmidt Phil Schmidt spaces here we go Phil Schmidt flan T5 ixical cell shot at fp16 floating Point 16 updated one day ago this is what we're gonna have a look on just eight downloads here we go yes so it's rather brand new and we're gonna explore now the end points see I'm a little bit confusing but never mind so what we can do we have AWS we have Azure and we have Google cloud beautiful so we have a user beautiful then we create an endpoint we have here the hugging face repository with our sentence Transformers our vision Transformers or our I don't know diffusion we built a container we initialize the compute infrastructure we deploy our endpoint either on AWS Azure or on Google and we have an up and running endpoint which would be great in the cloud now this sounds interesting compliance yes everything is encrypted uh supported task what can we do oh yeah that's nice have a look at this text to image text classification with Transformers series of classification out of the box question answering summarization translation text to text generation feature extraction ranking image classification object detection okay okay okay okay so nice everything more or less up and running pricing let's look at the pricing so what do we have we have CPU instances okay so we have here AWS and here we have azure and you see here with 16 gigabyte of ram we have hourly rate half a dollar of a US dollar hourly rate and uh Azure of a dollar hourly rate beautiful GPU so which gpus do we have in the cloud we have only AWS this is interesting yeah and here we have our typical T4 only 14 gigs and this hourly rate of dot zero dollars and then right this is nice 44 okay but we need minimum 80. so we will go with an AWS x-large uh Nvidia a100 one 80 gigabyte memory and it will cost us what they say hourly rate 6.50 pricing example how is it calculated ah replica okay so you have here you can calculate monthly costs monthly ah hourly cost monthly cost okay if you want to run it for a company okay Advanced endpoints Auto scaling okay if you need Auto scaling from one to three machines beautiful beautiful Waters let's have a look at this yes yes yes okay create your first endpoint here we go here we go and the dogging face repository idea and your redesired end point so if you insert your personal credit card not mine but take your credit card Bank you you will end up I suppose where was I where was I here so you have an end point and then you define the model repository now our model repository is of course here copy Fields made flan T5 XXL shorter fp16 this is it we will where am I input this here and you can give the endpoint a name most beautiful T5 XXL endpoint ever beautiful what else let's make it a little bit bigger we select a cloud provider in a region only AWS will be available as a cloud provider for Eastern where EU West EU West one region so we have AWS more or less North Virginia Ireland Frankfurt Oregon Hong Kong beautiful you find your security level public private protected create your endpoint by default your input is credential with a medium CPU cost estimate assume the input will be up for an entire month the cost assumes the endpoint will be up for an entire month and does not take or to scaling into account wow does this mean we have to pay for an entire month for this is just an option and I can go with some hourly rates if I just want to have a job for 10 or 12 hours okay so create endpoint tells me what it costs per month wow if I have six dollar per hour for my GPU alone so you wait for the endpoint to build like we know from AWS and whatever one to five minutes yes okay yes yes yes yes yes yes and we test your endpoint in overview with the inference widgets and we have to get an address where's our address ah here send requests to endpoints so now to to be able to to send something to this endpoint that is up and running in our cloud we need some address here copy a zero beautiful request should this assist yes yes yes ah here we have input and your authorization beautiful inference API message maker inference toolkit okay update your endpoint update your endpoint why would you okay again okay if I see I want to upgrade my machine I have here CPU small medium large beautiful Auto scaling you can update your auto scaling configuration okay Advanced setup oh here we go okay AWS instance CPU medium and a small GPU and we would go with a 100 GPU x large task framework revision image yes yes yes create a private endpoint create custom interference Handler yes okay I see what I mean okay okay okay here I would like to switch to another presentation I found by bye bye bye Phil Schmidt I just added it in the internet and I found a very nice article here on on what it is www field Schmidt deploy slash minus T5 minus 11 billion look there's a beautiful thing there even is a code deploy 11 billion parameter model for inference for less than five hundred dollars let's have a look at this because at first open Lincoln we have here oh is this more or less no this is T5 sharded ah yeah here we have it there you have the code yes but the article this this is a very nice presentation have a look here this is tutorial cover how to prepare the model repository custom Handler additional dependencies deploy the custom Handler as an interference endpoint and send you HTTP request using python this is what I want to show you so he explains here what is our hugging face inference and point beautiful you know this already on a single End video yeah this will not work normally too big to deploy on an Nvidia T4 to be able to fit T5 11 billion model into a single GPU where it goes yeah exactly here look makes precision and shorting you notice I showed you and here he introduces I haven't showed you this llm.integer8 a new quantization quantization technique for integer 8 matrix multiplication which cuts the memory needed for insurance by half which wow okay great yes has a repository yes yes yes here and if you want to convert the right rates yourself to deploy flan T5 x6l you need at least 80 gigabyte of memory yes we know look at this this is not so complicated torch Transformer hugging face hop beautiful Lotus float16 you know this this is our model as I showed you 100 times before model with language with a language masked head from pre-trained we have here the T5 11 billion model torch type float 16 like I showed you in my last video when we implemented the XL version of the free Call of notebook low CPU memory usage true okay interesting interesting chart model and push to HUB push to HUB beautiful after we have our assorted fp16 model yes but we have this already we just use we are gonna use his yeah you agree I agree we we both agree beautiful so after we have our shouted and waits we can prepare an additional and we need element yes that's natively yeah bits and bytes you notice I showed you this uh we need to accelerate we need to beat some bytes documentation and the endpoint Handler beautiful look this is all I think that we have already assured you before we have to pre-process we have our tokenizer this is so simple with our input IDs then we have model generate this is our output function with some parameters else without some additional parameters and then we have the post process stage where we say Okay decode yes yes yes and then we have returned the generator text and Skip special token exactly like I showed you in my other Excel video beautiful and then we deploy the custom Handler as an inference point so here we go yes we're gonna use his model repository and his smart repository is no it is not this one but it is come on plan T5 h6l yeah this one we're gonna use this one Phil Schmidt plan T5 XXL shot at fp16 this is what we're gonna use this was updated yesterday so where was I a little bit confused but we we managed this don't you worry so this is our here a model Repository we can use an endpoint name whatever you prefer okay open advanced settings select GPU small we go with GPU x-large because we changed the Repository if you are trying to deploy the model on CPU alone the creation will fail yeah well okay I have an idea why the inference endpoint service will check during the creation of your own but if there's a Handler Pi available we use it for serving requests no matter which task you select the deployment will take 20 to 40 minutes okay this is for the small model so expect an hour to three hours we can test it using inference widget test your endpoint and here is exactly you have here text to text generation task is a compute and this is the answer generated by the system beautiful so what we need to do we have now something up and running in the cloud and now we just have to send an HTTP request and we do this in Python the easiest thing you know our old friend PPP installed requests my goodness this is when I was back in school yes import Json import requests our endpoint and point URL inputs translate English to Germany okay if you want this task you can Define the maximum length beautiful you have here the authorization and then you simply send your request and here we go r dot post endpoint URL hat as Json payload response to generate the text show me the generated text how easy is this so more or less exactly what is running here on our Google flying T5 XXL the hosted inference API you put in the text click compute and you get the response uh timed out and maybe we should start it model is loading so either if you have a concrete task I would recommend have a look here at the full Google flat T5 XXL see what comes out well can be loaded nutrients on demand oh gee that's going to take some time I'm recording this uh at the evening so heavy internet traffic of course so where were and that's it conclusion this is it how you deploy uh Google Now Phil Schmidt flan T5 X XL model here um what's it called hugging face no the other one hugging face my goodness inference point where is it here hugging face interference end points a fully managed infrastructure let's sing currently with AWS and a little bit of azure currently without any Google integration but I hope they will also integrate Google so there you have it now successfully deployed the 11 billion parameter T5 model hugging phase interference endpoint for less than 500 dollars less than five hundred dollars okay he uses as we saw a T4 GPU small Nvidia Tesla T4 with an auto scaling up to two so he says I suppose the the monthly rate another line we deployed one of the biggest level yes yes yes yes yes sign up and create something beautiful interesting interesting so something completely new I discovered this together with you I just had a glance over before I started this video but I thought hey why not have a look at this so if we assume that we take here a T4 that costs dot six dollars per hour I would say this is the monthly cost to deploy the T5 11 billion parameter model on a T4 we have here our Google flan T5 XXL now you know how timed out again you know how to run it on an a100 with hugging face inference endpoint we saw that we have this flan T5 XXL sharded fb16 is great and he tells us that we can go away here if we use a GPU medium and Nvidia a10g and we know pricing and video 10g which 24 gigabytes wow this is this is a NV this is NVIDIA 1490 I think has also 24 gigabytes if I'm not right I don't know so we have an hourly rate of 130. thank you for exploring this together with me it was a little bit confusing I know but I try to have fun if whenever I learn something new machine learning At Your Service it's so nice that you have all Transformer models all diffuser model or sentence Transformer model so if you really have a huge task you can go you do not have to go to AWS or Azure or Google Cloud but there is a nice hugging face interface for you nice I don't know if it's really competitive if some viewer have some overview and they used both and they have also experienced here with hugging face inference endpoint please leave your suggestion your experience with this cloud model for us to learn based on this I say thank you for watching it was a very spontaneous video recording I hope you enjoyed it the same way I did and I'll see you in my next video
Original Description
Easy Cloud Inference! Today I discover a new Flan-T5-XXL model repository on Huggingface, which can run (optimized) on a NVIDIA A10G. Or run Google's Flan-T5-XXL on A100 GPU. PLUS: First time discovery of Huggingface's Inference endpoints! What are Inference endpoints by HF: a fully managed cloud compute infrastructure (eg AWS, AZURE, later GOOGLE) where I can use my HuggingFace repositories from any TRANSFORMER or Sentence-Transformer model and run directly a cloud compute! A new milestone for easy inference of 11b parameter LLMs.
Great solution by @HuggingFace
Classical rep by Google (NVIDIA A100 or 8xA100 w/ 640 GB):
https://huggingface.co/google/flan-t5-xxl
New rep by PhilSchmid (NVIDIA A10G):
https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16
Tutorial by Phil Schmid (recommended!):
https://www.philschmid.de/deploy-t5-11b
Inference Endpoints (start):
https://ui.endpoints.huggingface.co/welcome
#ai
#naturallanguageprocessing
#generativeai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Discover AI · Discover AI · 23 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
▶
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
Create a Smarter Future!
Discover AI
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
Microsoft and ChatGPU
Discover AI
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
ChatGPT - Can it Lie to you?
Discover AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
New BING ChatGPT loses its mind
Discover AI
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
New BING Chat AGGRESSIVE
Discover AI
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
Microsoft's CEO in Trouble #shorts
Discover AI
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
ChatGPT polarizes
Discover AI
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
ChatGPT: Multidimensional Prompts
Discover AI
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
AI: Energy Taker or Energy Maker
Medium · AI
When AI Asks for More Electricity Than a Country Can Imagine
Medium · AI
You Are Not Behind. The World Is.
Medium · AI
Career choice with the advent of AI - pure Computer Science or learn software with a background of core engineering area
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI