Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

Discover AI · Beginner ·🧠 Large Language Models ·3y ago

Key Takeaways

The video demonstrates how to unleash the power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL, and DeepSpeed, showcasing the ultimate boost for LLM inference. It covers the implementation of model parallelism, tensor parallelism, and low latency inference using DeepSpeed and DJL.

Full Transcript

hello community Bloom llms the full-fledged version with 176 billion parameters now I showed you that there are a lot of different llm model from some baby models from some medium-sized model to some models like Palm that have half a trillion parameter and I have a video here where I showed you the flan T5 XL model that we can run an inference task on a free color notebook as I showed you here we have here three billion parameter model now in one of my last videos I showed you the upgrade to the Flying T5 X XL model this is already an 11 billion parameter model and we used here the hugging phase and points here they use itself again the AWS a100 GPU and I showed you the code for the inference task of flan T5 XXL now you see the jump to the next model with half a trillion parameter is huge so a lot of you my viewers ask hey what about the 176 full-fledged blue model because I only have on my channel here videos about Bloom on a free collab notebook and this of course is only a baby model with just some single billion parameters but not the full-fledged 176 billion parameter model and Bloom needs more power simply more power so we have to go to AWS we select the biggest machine learning instance the p4de with 1.1 terabyte memory and we apply sagemaker deep learning containers we apply deep Java Library serving and there's a very special technique we're going to apply and it is called Model parallelism we will not apply the pipeline parallelism from hugging face with its accent within accelerate but we go for the tensor parallelism yeah the tensor parallelism or the deep space model here at AWS they have the benefit of low latency with multiple gpus simultaneously if you want to learn more about the Deep speed interference here this is the best literature I could find to this topic enabling efficient interference of Transformer models at unprecedented scale by Microsoft here is the archive preprint have a look at this documentation now if you look for the Deep Java Library serving and the Deep speed model parallel inference the Best Literature I could find was the original AWS machine learning block where in September 22 they described in detail here how you deploy large models on Amazon sagemaker anyway what I would like that you take away that there is a deep Java library that is open source high level engine agnostic framework for deep learning and we will use it as a model serving solution it is a high performance Universal model serving solution that is programming language agnostic Pi torch tensorflow Apache mxnet whatever there is now if you want to have a deep dive there's the Amazon sagemaker developer guide with more with about 4000 pages and at page 2026 I would recommend the chapter on Real Time inference to really learn more about this topic so in summary we can say large language models can be difficult to host especially for low latency inference task simply because of their sheer size this is valued for models with more than a 100 billion parameters they can be too big to fit into the memory of a single GPU of a single accelerator so you have more or less resolution first simple and slow approach is to use the CPU memory and stream the model parameter to sequentially to your accelerator and you know what's gonna happen you're gonna create a bottleneck between your CPU and your GPU the second solution is a little bit more clever you compress the model you squeeze the model so that it can fit on a single GPU now this is a rather complex technique such as quantization puning distillation and I will show you later on that we have something like an llm into eight that we can use but this requires time expertise and in some cases it can reduce the accuracy and the generalization of your language model now solution number three is the professional solution we are working here on AWS clusters so we use model parallelism here we have two ways to go as I showed you Ida you go like hugging face with their accelerator engine or we go with AWS deep speed I will show you that the later will deliver some better low latency interference without impacting the accuracy of the model I think this is it from the theoretical standpoint we have now a basic understanding hey let's just jump right into the code so now here we go surf llm's large language models enabler WS sagemaker with deep speed container so we run an inference task and we run the bloom full 176 billion parameter model and we will use here's AWS Edge maker with the latest container so we use deep speed and deep Java Library deep Java Library provides the serving network net work while deep speed is the key schroding Library we leverage to enable hosting of large language models so here we go four easy steps was at first as always you are familiar with this we have a large pre-trained NLP model this is on hugging face and we download the model from hugging face then we deploy this model now on Amazon sagemaker across multiple gpus we will use 8 gpus but on a single sagemaker machine learning instance and then as I told you we will use the Deep Java Library serving and in particular the tensor parallelism form deep speed to optimize here that we achieve low latency in our generative AI task and this is text generation like Jack jet GPT so here we go no no here we go first of course is yeah if you are not familiar with AWS there is something called a software development for python here on AWS it's called boto 3 which allows python developers to write software that makes use of services like Amazon S3 or ec2 or whatever and you're not going to believe AWS CLI is a command line interface to Amazon web services this is all there is now as I told you we have here on hugging we have a model and hugging face let's have a look at the model here's it Microsoft here may be a little bit bigger Microsoft Bloom deep speed inference this is exactly what we need and now integer 8 so this means we have more or less a 8-bit matrix multiplication so this is a custom int 8 version of the original Bloom weights to make it fast to use with the Deep speed inference engine which we will use on AWS and we want to implement tensor parallelism for low latency in our inference task this Reaper in this group The tensor split into eight shorts of course because we have eight Target gpus a 100 gpus so yes yes yes beautiful so you know this is exactly the model that we're looking for now I know what you say you might say hey 8-bit Matrix map multiplication are you joking look at this paper there's always a scientific paper that tells you something this here you know papers with code I love it so llm and 8 is an 8-bit matrix multiplication for Transformer at scale August 2022 and here with our method as 175 billion parameter 16 or 32-bit checkpoint can be loaded converted to into 8 and used immediately without performance degradation this is amazing if you think about it this is made possible to understand and working around property of Highly systematic emergent features in Transformer language model that dominate attention and transform a predictive performance so they developed a two-part quantization procedure this llm into 8 so first Vector wise quantization this for yes yes yes you can read all of these yourself never mind it just tells us we can use it so here we go we have now here on our hugging face Microsoft plume Deep's big inference into it exactly as I showed you here Microsoft plume deep speed interference into eight this is exactly the model that we upload now to S3 done beautiful now sagemaker is a little bit complicated because this here's the sentence sagemaker needs the model to be in the tall Bowl format so and in plus here we create a model with the inference chord code to shorten the endpoint creation time this we already done in some other videos I showed you we create a model with the inference code inside so we kick off a multi-threaded approach to download the model weights in the container using is exactly what I told you before AWS CLI so turbo if you're not familiar with it we have more or less free files first one is modeling now in our model python file this is the key file which handle any requests for serving the model it is also responsible for loading the model from S3 after the end point on AWS has been spun up the model is loaded into the temporary space on the container because hmaker Maps the temporary to the Amazon EPS volume yes yes yes you don't have to care about this the requirements text file is simply the library needed twins to be installed when the container starts up and our serving property this is so easy normally we only have there one line of code and this is something about deep space deep speed I'm gonna show you in a second so we import Sage Mega Bojo time Json everything everything referencing the weak rate variables initialize them to create the end point and we leverage Bluetooth 340s BS standard have don't have to talk about it now what you have to be specifically careful about is which large model inference contain an image with with DGL serving you will use my goodness so to make it easy large model inference is simply called LMI deep learning containers are called DLCs and those are Docker images on Amazon ECR these containers and this is the most important sentence these containers include all the necessary component your needs the libraries their drivers to host those large models on Amazon sagemaker or ec2 infrastructure so this is it and this containers are available for you so you just have to find the right model so let's jump in here let's make it a little bit bigger and you see here large model inference container and we are here already what we need we have here the Deep Java Library serving version 020 this is the latest version with deep speed we have here the hugging face Transformers and of course the hugging face accelerate we will not use the accelerate but we will use deep speed but anyway this works with all Transformers that are available on hugging face and what this is the reason why it is so beautiful to have this containers so we have here the top tide top type of course is inference we have here eight gpus the python version is 3.8 and this is here groups exactly what we need so we go back we say beautiful now we know what we are doing you're choking I know and here we have our image identifier this is exactly what I just showed you here deep speed seven five one one six seven five one one six Amazon yes beautiful if you want to know more about the image URI beautiful here is hmega image URI retrieve everything you need to know if you want to have something else so as I showed you first topic is great to Turbo and upload it is to our s relocation beautiful so now this is a little bit of strange notation but hey we have just here the Magic Write file command so we write the code here of this jupyter notebook cell simply in this python file so you remember our library our directory is called underscore bloom176 and now the file name is model and you know why we know them we need the model yes because the turbo needs a model and requirements and you know which other files we will use in a second so beautiful so here we go as you know it it is from the Transformers from hugging phase we Import Auto configuration we import the auto model for causal language model we import the tokenize and auto tokenize a very general tokenizer yes yes yes and I just wanted to show you where is it where's the where is it that we have here more or less two function we have a function that defines get the model and we have a function that defines to handle the model now in to get the model you're not going to believe it this is what you know this is what we have done in dozens of videos when we work with llms or or sentence Transformers or pure Transformers or whatever we have a tokenizer auto tokenizer from pre-trend model and this is our model director where we have our pre-trained model on hugging phase you know this and then when we have the tokenizer we save the model is the auto model for causal language model ROM config and here we have a auto config from our pre-trained model directory and we have here float 16. so you see this is absolutely what you are familiar with and then here of course we have deep speed and we have here the inference so we have the model we have the base directory we have our checkpoints but this is all that you know already and the function simply Returns the model and the tokenizer no problem at all now the second function that we Define here to handle the model the handle is easy look at the commands here you are familiar with this we have here our input tokens so we apply our tokenizer on our data we want to have Pi torch tensors returned we have the padding activated and then we just say okay our output is of course model generate a right on our input tokens this is the standard procedure that you know and that you love and after we have defined this output we can say tokenizers batch decode special skip special tokens is true like I showed you in my the last video before my last video and so now we have our model and we have our output this is it this is all that you are familiar with input token and the output and we have here everything that you know and that you love so as I told you we have two other files the files here the requirements is of course here boto 3 and the AWS CLI and for the serving properties I told you that we just need to define the engine that we use deep speed more or less this is it yes yes yes create the model file and upload it yes so here we have it create the model file and upload it to our S3 session upload data yes finally done now yeah there is something about security we don't care about these two cells so to create now an end point an active endpoint we have three steps this is it that that's all that we need to do so create the model using now the image container and the top ball then create the endpoint configuration file with some parameters and we have more or less three parameters it is the instance now for instance I told you it's the P4 instance beautiful then you have more or less two timeouts you have the model data downloads timeout in seconds you set this to 2400 and you have the container startup health check time out in seconds you also set this to 2400 and with these three parameters you are done with your endpoint config and then you create oops you create the end point using the endpoint config that we just had here now there are some tensor parallel degree parameter we are working with 8 GPU so you guess what is the parameter we're gonna insert here if you want to know more about this there is some excellent literature and I already showed you some other research papers concerning your inference that you can run here with deep speed now next Point create sh maker model now we create a sagemaker model sometimes I'm amazed sometimes what I write yeah it is it is really amazing we use here the Amazon elastic container registry image provided by and a more loud effect from this yes yes yes setup we configure eight tensor parallel degree I told you what a coincidence and here we go the primary container here is our identifier from our image and here is our bucket and our prefix from S3 and we defined the tensor parallel degree with 8 gpus that's it yes volume size and gigabyte yes yes yes forget about it and then and this is almost the last step create a sagemaker endpoint Now isn't this nice we couldn't use any instances with multiple GPU for testing yes I know but we decided to go here for machine learning the P4 d e 24 x large and then as I told you we have two uh timeouts to define the model data download timeout in seconds and a container stop health check in seconds and this is it with these three parameters we have the config file ready and you're not going to believe what's happened now we create the sagemaker endpoint we say now sagemaker client create endpoint and this is it I know simple easy beautiful so wait for the endpoint to be created this can take a couple of minutes or longer I would like to stress the meaning of longer yes you can look at Snippets yes yes yes while creating yes yes yes yeah so and now we're ready finally to run inference and you know here we have for example three further parameter that you know the first is temperature the second is the new tokens and then the third one is about the beam search the number of beams that you want or a greedy so temperature you are familiar with the number of token is clear more token of course will increase the prediction time while what a coincidence and look it is it is absolutely the same as I showed you in my flan T5 video 25 XL and plenty five XXL this is absolutely the same so we have here our client where we invoke now the end point and we have let's look at the parameters at first we have here the no repeat engram size also a parameter you know the number of beams 5 or 12 I recommend 16 if you really want to be creative temperature okay dot 8 Max new tokens set it to I don't know 256 512 why be shy here minimum length only five this is really conservative and then of course you have your prompt this is the prompt and after this prompt after this text here the system will tell go on and depending on the number of new tokens that you want will continue with the text input that you provided here now here in this example we have amazon.com AWS is the best and then you went here an answer from the machine well what a coincidence so this is it you see there's really nothing to it there's just a little bit of AWS stuff happening around us because we really want to optimize here the execution of this of this and if you want to follow along this notebook is of course an official AWS notebook and all the credits go to AWS for providing this notebook I just modified it I just put in some explanation for us but otherwise this is all credits go to AWS and you can download this notebook here as you see GitHub AWS amazonsh make examples inference real time Bloom 176 billion and here the Deep Java Library deep space deep speed deploy heightened notebook this is the link go there download it half of the explanation is gone but this is just when you have to see it the first time now you know what you're gonna execute and now you know that you are familiar with what you see conclusion we demonstrated to use the sh Mega large model inference containers to host bloom176 yes there's also I don't know if you know this little company it's called Meta Meta AI they also have some models here the 30 billion opt model I'm no fan of this smile so there's a reason why I show you the blue model with 176 billion parameters but anyway we use here the model parallel technique with multiple gpus on a single Sage maker ml instance everything else yes you see here you have quotas if you want about timeout and then of course as always on AWS we delete the end point in case the endpoint failed we still want to delete the model and we can delete the model checkpoint from S3 also so everything clean tidy beautiful there are no costs occurred to us this is it as you see this is more or less a simple notebook it has very much the same content if you look here at the model we have here our tokenizer we have here our download from hugging face so I think you should feel rather familiar here it is absolutely the same code sequence that you that we use here it is just a little bit more complicated because we are running this on a highly distributed engine AWS instance so this is it if you want and if you like this video If you upload this video if you want next time we can connect to ews and I'll show you this in real time until then I say thank you for watching thank you for listening and I see you in my next video

Original Description

More Power! How and where to run inference of an LLM w/ 176 billion parameter? Well, what about the most expensive ML instance on AWS? The most performant implementation for LLMs (utilizing latest .. and most expensive .. cloud infrastructure)? Some implementation ideas ... Regarding LLM inference code implementation: what LMI DLCs on Amazon ECR to apply? Should we use model parallelism, and if yes, pipeline (like HF's accelerate) or tensor (like DeepSpeed)? Do we have a language agnostic model serving, and if yes, how to apply Deep Java Library serving in Pytorch? Interested to spend some hundreds of US$ on Amazon SageMaker for maybe (if successful ...) a single hour of inference, but ... extreme low inference latency? This is the latest in tech? Hey Amazon: More Power! The next trillion parameter models are visible at the horizon. In short: A iypnb for you to experience the full BLOOM 176B model inference task on the most expensive cloud infrastructure. - Not for beginners - - This video is just for nerds with deep pockets - - As a beginner do not pay a cloud provider for their top tier infrastructure, always start with a reasonable cheap instance and experience system behaviour and cost accumulation - Thanks to AWS to provide this Jupyter Notebook to show us how to run inference of huge LLMs on their latest infrastructure. I just wonder why they are so charming to show us how to spend our hard-earned money on latest AWS cloud tech??? Any ideas? Not sponsored by anybody (unfortunately). 00:00 BLOOM 176B vs Flan-T5-XXL 01:36 More Power! 03:45 3 Options to run LLMs on GPU 05:42 ipynb SageMaker DeepSpeed Container #bloom #ai #naturallanguageprocessing #generativeai #chatgpt
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 30 of 60

1 Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
2 Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
3 Create a Smarter Future!
Create a Smarter Future!
Discover AI
4 The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
5 Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
6 Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
7 Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D   (SBERT 48)
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
8 Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey!  (SBERT 49)
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
9 SBERT Extreme 3D: Train a BERT Tokenizer  on your (scientific) Domain Knowledge  (SBERT 50)
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
10 Discover Vision Transformer (ViT) Tech in 2023
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
11 Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
12 Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
13 BERT and GPT in Language Models like ChatGPT or BLOOM |  EASY Tutorial on Large Language Models LLM
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
14 Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source)  #shorts
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
15 From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
16 How to start with ChatGPT?  | Short Introduction to OpenAI API #shorts
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
17 The Future of Conversational AI? Google's PaLM w/ RLHF  | LLM ChatGPT Competitor
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
18 Microsoft and ChatGPU
Microsoft and ChatGPU
Discover AI
19 From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
20 Google's 2nd Answer to "BING ChatGPT":  Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
21 TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
22 3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
23 FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
24 ChatGPT - Can it Lie to you?
ChatGPT - Can it Lie to you?
Discover AI
25 ChatGPT Alternative: Perplexity by Perplexity.AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
26 2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
27 Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
28 BLOOM 176B Inference on AWS  | Bigger than GPT-3 for more Power!
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
29 Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings?  My own ChatGPT? | Visual Q+A
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
31 After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
32 Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
33 Fine-tune ChatGPT w/  in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
34 The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
35 New TECH: Vision Transformer 2023 on Image Classification | AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
36 PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned  | AI  Tech
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
37 New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
38 New BING ChatGPT loses its mind
New BING ChatGPT loses its mind
Discover AI
39 Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
40 Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
41 Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
42 PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
43 New BING Chat AGGRESSIVE
New BING Chat AGGRESSIVE
Discover AI
44 Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
45 Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
46 Dream Job Alert: AI Prompt Engineer - $335K  |  AI Prompt Design: A Crash Course
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
47 Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
48 Microsoft's CEO in Trouble   #shorts
Microsoft's CEO in Trouble #shorts
Discover AI
49 Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
50 OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
51 ChatGPT polarizes
ChatGPT polarizes
Discover AI
52 Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
53 ChatGPT Prompt Engineering w/ in-context learning (ICL)  - 7 Examples | Tutorial
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
54 Chat with your Image!  BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
55 ChatGPT:  Multidimensional Prompts
ChatGPT: Multidimensional Prompts
Discover AI
56 ChatGPT:  In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
57 Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
58 Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
59 Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
60 Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI

This video teaches how to unleash the power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL, and DeepSpeed, covering model parallelism, tensor parallelism, and low latency inference. It provides a comprehensive guide on implementing these techniques for LLM inference.

Key Takeaways
  1. Run an inference task with BLOOM 176B on AWS ml.p4de.24xlarge
  2. Deploy the model on Amazon Sagemaker across multiple GPUs
  3. Use Deep Java Library serving and tensor parallelism to optimize for low latency
  4. Quantize the 16 or 32-bit checkpoint to 8-bit using LLm8
  5. Create a Sagemaker model with Amazon Elastic Container Registry image
  6. Configure Sagemaker endpoint with P4de24xlarge instance and multiple GPU timeouts
💡 The video highlights the importance of using model parallelism and tensor parallelism to achieve low latency inference for large language models like BLOOM 176B.

Related AI Lessons

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know
Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology
Dev.to AI
Call GPT, Claude, and Gemini from one API key — a 3-step setup
Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub
Dev.to AI
Your LLM Doesn’t Pick Stocks — It Remembers Them
Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies
Medium · Machine Learning
Word Representation
Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation
Medium · NLP

Chapters (4)

BLOOM 176B vs Flan-T5-XXL
1:36 More Power!
3:45 3 Options to run LLMs on GPU
5:42 ipynb SageMaker DeepSpeed Container
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →