Model Quantization for efficient deployment with Amazon SageMaker AI | Amazon Web Services

Amazon Web Services · Advanced ·🧠 Large Language Models ·7mo ago

Key Takeaways

The video discusses model quantization techniques, including AWQ and GPTQ, for efficient deployment of large language models using Amazon SageMaker AI, and demonstrates how to deploy quantized models using SageMaker and DJL inference serving container.

Full Transcript

Hey everyone, welcome to Amazon SageMaker AI science corner videos, your go-to location for deep data science and large language model content. My name is Pranav Morti and I'm a senior geni data scientist. I work in the Amazon SageMaker AI team. I love building intelligent and autonomous systems. In this video, uh we'll dive into efficient deployment techniques using Amazon SageMaker AI with a special focus on quantization, quantized approaches to deploying large language models. to follow along. Make sure you have access to your GPUbased studio notebook instances. If not, that's okay. You can still run these in SageMaker AI as jobs. So, you'd still have access to the code and you'll still have access to all the processes. In this particular video, we'll cover the two most popular uh types of quantizations uh methods and walk through how to deploy customtuned quantized model on SageMaker AI in just a few steps. to take a step back broadly there are two main approaches to large language model quantization there's offline quantization methods tools like uh AWQ GPTQ and GGUF some of the most popular methods are used to quantize the model weights ahead of time this means that you prepare and save a quantized model before deployment reducing the spite size and speeding up inference right from the start method is online or runtime quantization ation libraries like bits and lights apply quantization dynamically at inference time. This allows you to use standard model weights and benefit from quantization without needing a separate pre-processing steps. Both approaches are widely used to make large language models more efficient for real world deployments especially when running on resource constraint environments. For now, let's focus on two optimization techniques uh and dive deep into that. The first one is activation aware weight quantization technique or AWQ and the GPT quantization technique or GPTQ. So let's take a quick look at uh AWQ activation over rate quantization. What is it and what is it? What is the purpose that it's serving? AWQ is a quantization technique that focuses on minimizing the impact of quantization errors by considering activation statistics during the quantization process. It optimizes how model weights are quantized based on how activations behave in each layer leading to better accuracy retention especially for large language models. AWQ is typically applied offline uh before deployment producing a quantized model ready for efficient inference. This method enables faster inference and smaller model sizes with minimal quality loss. AWQ is especially popular for compressing models for edge or GPU constraint environments. The second method we want to focus on is GPTQ or GPT quantization. GPTQ is an efficient post-training quantization method designed for large language models, especially the GPT family. It quantizes model weights after training, focusing on minimizing output errors layer by layer for high accuracy. The process is performed offline creating a static quantized model for deployment. Similar to AWQ, GPDQ is widely used for deploying large models on hardware with limited memory as it significantly reduces model size and speeds up inference. It strikes a strong balance between model compression and maintaining generation quality. I'd also like to do an honorable mention to bits and bytes from an inference perspective. So let's just take a quick look at what bits and is. Bits and Bites is a library that enables runtime quantization, applying quantization to models dynamically during inference. It supports a range of quantization types such as 8bit and 4bit without needing to pre-quantize model weights. This allows users to quickly experiment with quantization and deploy standard models with reduced memory footprint and compute requirements. Bits and Bites is easy to integrate with popular frameworks like hugging face transformers. It's ideal for rapid prototyping or situations where pre-quantizing the model isn't practical. In the next few minutes, we'll learn how to quantize a model uh with AWQ or GPTQ method. Uh firstly, let's go over the AWQ quantization script first. So uh top to bottom, what we have is a bunch of imports. Um there is a requirement for the AWQ uh which is auto AWQ package which you can uh install from pip. Um we'll take a quick minute but let's just go through all of the different parameters that are available. Um and I'd like to quickly highlight how easy it is to uh quantize a model with hugging face where you just call uh model.quontise. hugging facerw makes it super easy for you to just load the model in provide the tokenizer and out you get the quantized model. Now let's just understand what are these different uh parameters that we have available. The first is the zero point which enables zero point quantization improving range and accuracy of quantized weights. The second parameter to keep in mind is the Q group size. So it effectively sets the number of weights grouped together by quantization. 128 is always the default and for most use cases 128 works really well. Large groups can uh improve efficiency but may slightly impact accuracy. So something to keep in mind as you context these models. The next parameter to keep in mind is the w bit. Specifically the number of bits uh used to represent the weight. Most AWQ models are 4bit compressions. So we recommend keeping it to 4bit but if you need smaller footprint you can go three or two. Uh but keep in mind that it reduces accuracy. The last one is the version. So you get gem m and gem v. Uh it essentially selects a quantization algorithm. Uh one of which is optimized for batch the other is optimized for speed. So in short the script is fairly simple. You've got all of these uh parameters uh available uh for runtime. Now let's just take some default parameters and then quickly quantize a model. So I already have uh a model that was uh fine-tuned upstream available to me. This is just a base model that I fine-tuned a lava 3 to 3 billion model that I fine-tuned using spectrum fine-tuning technique for about 10 epox. And now I would like to convert this model into an AWQ quantized model so I can deploy it for inference. So this is what my script looks like. I have I'm calling the AWK model quantization.py. I supply the model.pat which can be the local pack. Um I provide the uh model name that I'd like to uh deploy it as. Uh and I would like 0 enablement 1284 and gem. And then finally the output directory to where I need or where I want this to be stored. I'm going to remove this path that I'm just going to add as V2. And let me go ahead and paste this in the in the command line. So while this runs, uh I already have a quantized model that's ready to go. So as you can see, all of the layers are being quantized. So AW is going layer by layer and trying to quantize and reduce the footprint of the model. While it does that, let's just take a quick peek at the uh the base model llama 323 billion and the AWQ quantise model. Now just from the outset you can quickly see that on the top this is the model that's available in hugging face. The model is almost 5 and a half gigs in size. Now however the same model post AWQ quantization is less than half. It's only 2.1 GB in size. So it's a pretty significant reduction in footprint. Now, how do you deploy this model? That's the important point. The answer is SageMaker makes it very easy. Let's take a quick look at that after uh we run through the GPTQ quantization method as well. We took a look at how AWQ quantization works. And by the way, the quantization is about to complete, but let's just jump into the GPTQ quantization. It's similar thematically. Um the script quantizes a pre-trained language model using the GPTQ method making it sustainable for efficient inference. We know this already but what are the key parameters uh that goes into GPTQ quantization similar to AWQ we set the bits these uh this can be 4 32 similar to AWQ uh which controls the model compression and efficiency. The second is a group size similar to AWQ. This is a balance of performance and accuracy. We have the calibration size which specifies how many calibration samples are used to estimate quantization parameters improving essentially the output quality. Lastly, uh there is a parameter which is use v2 where you can use the uh second version of quantization which has shown some improvements but this is completely optional. At a high level, what the script does is it loads the pre-trained model and the tokenizer, prepares a calibration data set. Here we're using,24 samples from the uh C4 data set to accurately measure how quantization will affect the model output. Now you can use any data set even your custom data set to uh have the model evaluate and then compute the weights before sending the model in GPTQ context format. And then uh the script configures GPTQ quantization with the chosen settings. Runs the quantization process using the calibration data to minimize accuracy loss. Finally saves the quantized uh model and tokenizer for deployment. So the way we're going to run this is very similar to AWQ. Here we're going to reference the GPTQ model quantization.py. Um once again I'm using a fine-tuned model which was fine-tuned using spectrum finetuning technique robot tenipox and then I'm going to save uh the quant model name with the stopix as GPTQ and I'm going to select four bits with a group size of 128 and a calibration of 1024 samples and then finally the output directory effectively I'll just use the root here and this it's it's as simple as running this. Now one more thing to keep in mind is you can go larger in terms of bits. You can go from four to eight which means that you get slightly higher precision but your model size uh will be larger. So we now have our quantized model ready to go. Now the question you might be asking yourself or I have been asking myself is okay I now have a model but how do I deploy this and scale this out to many users that can leverage my fine-tune model in a scalable manner. So my easy answer to that is you can deploy your model any custom model on SageMaker AI hosting or SageMaker AI endpoints and we'll just quickly walk through how that's done. So we have a notebook here. Uh most of you know this. We import a bunch of uh required uh modules in here at the top. We instantiate uh we find out what region we need to host the model in. We instantiate the SageMaker session and we provide the role uh which is assumed at runtime. I'm going to use uh the uh DJL infant serving container which is very easy. Um I just need to supply a few configuration components. NDEL infrance serving container does the hard work for me about picking up the models uh instantiating it deploying it getting it ready for inference. The way you would deploy uh models with DJL uh inference serving container is you just specify the ECR URI which is managed by AWS. So uh if you know the URI you can just directly plug it in. So I know the version that I need. I need the DJ serving uh DJ inference container uh 0.31 with a certain CUDA version that I know it works for sure. But if you're unsure, you can also use SageMaker image URI retrieve function to actually retrieve um for a given framework and the version you need. Okay, I'm going to name this model something unique that I know. So I'm going to call it custom llama reasoning R1 distilled with some datetime. Now the second thing I I I need to do is I need to upload uh the model to an SC bucket because this is a private model that I own. I can host it on hugging phase which means that it's widely and publicly available. Uh or I can host it in S3 and just allow my DJ gel inference serving container to just tap into S3 and pull the model for deployment which is a much more secure way to do it in this context. So we upload the file to S3. Um this is the path that we're using. Now the same path I'm going to copy into the uh environment configuration uh that is used by DJL container for deployment. So I have the model ID. The model ID can be one of two things. It can be the path S3 path and DJ gel container automatically recognizes what type of model it is or it can be hugging face path which means it pull it from hugging face hub. In addition to that you can specify parameters like what's the max model length how much of GPU resource it needs to use whether you're enabling streaming what is a rolling batch type etc. Uh there are a whole host of uh configuration elements that you can set to get the maximum out of the hosted model endpoint and then simply just deploy. Now one thing you need to keep in mind is what type of instance you'd like to choose in order to deploy these models. Given that we contize these models and these models are just 2 gigs in size, you may be able to host them even on a technically on a 16 GB GPU as well. But I'm using a G52 extra large which cons contains 24GB of memory uh which is plenty for an AWQ model which means that I can increase the number of concurrent requests and that single instant instance can serve so many users within my ecosystem. So once that model is deployed it'll give me a new model endpoint. Now, that may take between 6 to 12 minutes, but I already have a model that's deployed and ready to go. So, I'm going to go back to my SageMaker uh studio, and then I'm going to navigate down to endpoints. And then here I have the custom llama that I predeployed uh just a few minutes ago. So, I'm going to copy the model name here. Go back to my notebook and then I'm going to attach myself to that custom model endpoint name and then I'll instantiate a new predictor. You don't have to do anything else other than just declare a new SageMaker predictor and now you're good to supply uh your text. Now, the one thing that you may have to do may choose to do is format the input before sending it to G DJL endpoints. In this case, that's exactly what I'm doing. But at runtime, you can configure parameters like temperature, top P, top K, max tokens, etc. Now, I'm just asking a simple question. Who are you? And what's 2 + 2? And you can see that the model actually thinks through. And you can see it generated a bunch of tokens in a very, very limited period of time. So that's it. So this is how you deploy a model uh for inference on SageMaker AI whether it's quantized or base models. It's very easy and we hope uh you follow along with us. Thank you.

Original Description

Learn about efficient deployment techniques using Amazon SageMaker AI focusing on various model quantization approaches to deploying models for inference. This video discusses various approaches to quantization and their benefits. Model quantization is a technique used to reduce the computational and memory requirements of large language models, by reducing the precision of the model's parameters and computations, enabling faster, more efficient deployment with minimal accuracy loss. To learn more, visit https://go.aws/4hUrxiX Subscribe to AWS: https://go.aws/subscribe Create a free AWS account: https://go.aws/signup Try AWS for free: https://go.aws/free Connect with an expert: https://go.aws/contact Explore more: https://go.aws/more Next steps: Explore on AWS in Analyst Research: https://go.aws/reports Discover, deploy, and manage software that runs on AWS: https://go.aws/marketplace Join the AWS Partner Network: https://go.aws/partners Learn more on how Amazon builds and operates software: https://go.aws/library Do you have technical AWS questions? Ask the community of experts on AWS re:Post: https://go.aws/3lPaoPb Why AWS? Amazon Web Services is the world’s most comprehensive and broadly adopted cloud, enabling customers to build anything they can imagine. We offer the greatest choice of innovative cloud capabilities and expertise, on the most extensive global infrastructure with industry-leading security, reliability, and performance. #AWS #AmazonSageMakerAI #SageMaker #AmazonWebServices #CloudComputing
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Amazon Web Services · Amazon Web Services · 0 of 60

← Previous Next →
1 Agentic AI Design Patterns Introduction and walkthrough | Amazon Web Services
Agentic AI Design Patterns Introduction and walkthrough | Amazon Web Services
Amazon Web Services
2 Galileo on modernizing on banking infrastructure | Amazon Web Services
Galileo on modernizing on banking infrastructure | Amazon Web Services
Amazon Web Services
3 Alliander Speeds Innovation and Energy Transition Using AWS | Amazon Web Services
Alliander Speeds Innovation and Energy Transition Using AWS | Amazon Web Services
Amazon Web Services
4 AWS and Scuderia Ferrari HP streamline F1 power unit assembly | Amazon Web Services
AWS and Scuderia Ferrari HP streamline F1 power unit assembly | Amazon Web Services
Amazon Web Services
5 How AWS machine learning supports Scuderia Ferrari HP pit stops | Amazon Web Services
How AWS machine learning supports Scuderia Ferrari HP pit stops | Amazon Web Services
Amazon Web Services
6 Nasdaq Builds Market Infrastructure of the Future with AWS | Amazon Web Services
Nasdaq Builds Market Infrastructure of the Future with AWS | Amazon Web Services
Amazon Web Services
7 AWS Security Hub Exposure Findings | Amazon Web Services
AWS Security Hub Exposure Findings | Amazon Web Services
Amazon Web Services
8 How do I use Session Manager port forwarding to connect to my EC2 instance through RDP?
How do I use Session Manager port forwarding to connect to my EC2 instance through RDP?
Amazon Web Services
9 How do I extend an EBS volume with LVM partitions?
How do I extend an EBS volume with LVM partitions?
Amazon Web Services
10 AWS Graviton makes it easy to optimize performance, cost, and sustainability | Amazon Web Services
AWS Graviton makes it easy to optimize performance, cost, and sustainability | Amazon Web Services
Amazon Web Services
11 Run Cloud Adoption Framework workshops with Miro | Amazon Web Services
Run Cloud Adoption Framework workshops with Miro | Amazon Web Services
Amazon Web Services
12 Getting Started with AWS Cost Optimization Hub | Amazon Web Services
Getting Started with AWS Cost Optimization Hub | Amazon Web Services
Amazon Web Services
13 Why did my Amazon SQS messages get sent to a dead-letter queue?
Why did my Amazon SQS messages get sent to a dead-letter queue?
Amazon Web Services
14 Declarative Policies for EC2 | Amazon Web Services
Declarative Policies for EC2 | Amazon Web Services
Amazon Web Services
15 How do I troubleshoot IAM permission issues for the Billing and Cost Management console?
How do I troubleshoot IAM permission issues for the Billing and Cost Management console?
Amazon Web Services
16 Integrity at Scale: Inside the Flo Health Mission | Amazon Web Services
Integrity at Scale: Inside the Flo Health Mission | Amazon Web Services
Amazon Web Services
17 Fueling Success: Small shifts, powerful performance | Amazon Web Services
Fueling Success: Small shifts, powerful performance | Amazon Web Services
Amazon Web Services
18 WEX enhances customer experience with AI-powered chatbot | Amazon Web Services
WEX enhances customer experience with AI-powered chatbot | Amazon Web Services
Amazon Web Services
19 Accelerate troubleshooting with Amazon CloudWatch investigations | Amazon Web Services
Accelerate troubleshooting with Amazon CloudWatch investigations | Amazon Web Services
Amazon Web Services
20 Why is my Windows WorkSpace stuck in the starting, rebooting, or stopping status?
Why is my Windows WorkSpace stuck in the starting, rebooting, or stopping status?
Amazon Web Services
21 Telemetry Pipelines for AI | Amazon Web Services
Telemetry Pipelines for AI | Amazon Web Services
Amazon Web Services
22 Getting Control over Security and Observability Data | Amazon Web Services
Getting Control over Security and Observability Data | Amazon Web Services
Amazon Web Services
23 The Problem with Telemetry Data Volume | Amazon Web Services
The Problem with Telemetry Data Volume | Amazon Web Services
Amazon Web Services
24 Telemetry Pipelines on AWS | Amazon Web Services
Telemetry Pipelines on AWS | Amazon Web Services
Amazon Web Services
25 What are Telemetry Pipelines? | Amazon Web Services
What are Telemetry Pipelines? | Amazon Web Services
Amazon Web Services
26 Using AI for RegEx on Telemetry Pipelines | Amazon Web Services
Using AI for RegEx on Telemetry Pipelines | Amazon Web Services
Amazon Web Services
27 Multi-Session Support in the AWS Console | Amazon Web Services
Multi-Session Support in the AWS Console | Amazon Web Services
Amazon Web Services
28 How CloudHedge delivers assessment with AWS ISV Tooling Program at no cost?
How CloudHedge delivers assessment with AWS ISV Tooling Program at no cost?
Amazon Web Services
29 How customers speed up migration and modernization to AWS with CloudHedge | Amazon Web Services
How customers speed up migration and modernization to AWS with CloudHedge | Amazon Web Services
Amazon Web Services
30 Chaos Experiment with Amazon ElastiCache | Amazon Web Services
Chaos Experiment with Amazon ElastiCache | Amazon Web Services
Amazon Web Services
31 Amazon S3 Access Points: Easily manage access for shared datasets on S3 | Amazon Web Services
Amazon S3 Access Points: Easily manage access for shared datasets on S3 | Amazon Web Services
Amazon Web Services
32 ElastiCache Valkey 8.0 - Savings and Efficiency | Amazon Web Services
ElastiCache Valkey 8.0 - Savings and Efficiency | Amazon Web Services
Amazon Web Services
33 Pennymac scales document processing with AWS | Amazon Web Services
Pennymac scales document processing with AWS | Amazon Web Services
Amazon Web Services
34 AWS | Next Level Innovation | Amazon Web Services
AWS | Next Level Innovation | Amazon Web Services
Amazon Web Services
35 Driving Cloud Innovation: Mindtickle's Partnership with AWS Enterprise Support | Amazon Web Services
Driving Cloud Innovation: Mindtickle's Partnership with AWS Enterprise Support | Amazon Web Services
Amazon Web Services
36 A Leader's Edge from Executive Insights | Amazon Web Services
A Leader's Edge from Executive Insights | Amazon Web Services
Amazon Web Services
37 How do I create a custom Amazon WorkSpaces image?
How do I create a custom Amazon WorkSpaces image?
Amazon Web Services
38 Charles Leclerc tests his AI-generated race track | Amazon Web Services
Charles Leclerc tests his AI-generated race track | Amazon Web Services
Amazon Web Services
39 Redington Scales India’s Cloud Access with AWS Partnership | Amazon Web Services
Redington Scales India’s Cloud Access with AWS Partnership | Amazon Web Services
Amazon Web Services
40 How do I prevent the resources in my CloudFormation stack from getting deleted or updated?
How do I prevent the resources in my CloudFormation stack from getting deleted or updated?
Amazon Web Services
41 How do I troubleshoot authentication errors when I use RDP to connect to an EC2 Windows instance?
How do I troubleshoot authentication errors when I use RDP to connect to an EC2 Windows instance?
Amazon Web Services
42 Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Amazon Web Services
43 Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Amazon Web Services
44 AWS at the FORMULA 1 AWS GRAN PREMIO DELL'EMILIA-ROMAGNA 2025 | Amazon Web Services
AWS at the FORMULA 1 AWS GRAN PREMIO DELL'EMILIA-ROMAGNA 2025 | Amazon Web Services
Amazon Web Services
45 What's new in RCPs | Amazon Web Services
What's new in RCPs | Amazon Web Services
Amazon Web Services
46 API Caching using Amazon ElastiCache | Amazon Web Services
API Caching using Amazon ElastiCache | Amazon Web Services
Amazon Web Services
47 Pendula: Amazon Nova Customer Testimonial | Amazon Web Services
Pendula: Amazon Nova Customer Testimonial | Amazon Web Services
Amazon Web Services
48 InDebted : Amazon Nova Customer Testimonial | Amazon Web Services
InDebted : Amazon Nova Customer Testimonial | Amazon Web Services
Amazon Web Services
49 Amazon DynamoDB global tables with multi-Region strong consistency | Amazon Web Services
Amazon DynamoDB global tables with multi-Region strong consistency | Amazon Web Services
Amazon Web Services
50 Siemens Mobility uses AWS to operate securely, efficiently on a global scale | Amazon Web Services
Siemens Mobility uses AWS to operate securely, efficiently on a global scale | Amazon Web Services
Amazon Web Services
51 How do I reuse a knowledge base session in Amazon Bedrock?
How do I reuse a knowledge base session in Amazon Bedrock?
Amazon Web Services
52 EP5: MBZUAI, CMU : Causal AI, Answering The “Why“ and “What if“ Questions | AWS for AI Podcast
EP5: MBZUAI, CMU : Causal AI, Answering The “Why“ and “What if“ Questions | AWS for AI Podcast
Amazon Web Services
53 Hema scales time to market developing a data mesh on AWS (Technical) - Cloud Adventures
Hema scales time to market developing a data mesh on AWS (Technical) - Cloud Adventures
Amazon Web Services
54 Hema scales time to market developing a data mesh on AWS (Business) - Cloud Adventures
Hema scales time to market developing a data mesh on AWS (Business) - Cloud Adventures
Amazon Web Services
55 How Langfuse Scaled Their AI Platform with AWS: From Open-Source to Enterprise | Amazon Web Services
How Langfuse Scaled Their AI Platform with AWS: From Open-Source to Enterprise | Amazon Web Services
Amazon Web Services
56 SLMs and LLMs: What’s the Difference? | Amazon Web Services
SLMs and LLMs: What’s the Difference? | Amazon Web Services
Amazon Web Services
57 SLMs and LLMs: When to use them? | Amazon Web Services
SLMs and LLMs: When to use them? | Amazon Web Services
Amazon Web Services
58 SLMs on CPU | Amazon Web Services
SLMs on CPU | Amazon Web Services
Amazon Web Services
59 Intelligent Model Routing | Amazon Web Services
Intelligent Model Routing | Amazon Web Services
Amazon Web Services
60 SLMs, LLMs, and Model Routing in Agents | Amazon Web Services
SLMs, LLMs, and Model Routing in Agents | Amazon Web Services
Amazon Web Services

This video teaches how to use model quantization techniques, such as AWQ and GPTQ, to efficiently deploy large language models using Amazon SageMaker AI, and demonstrates deployment using SageMaker and DJL inference serving container. By the end of this video, viewers will be able to deploy quantized models and optimize model size and inference speed.

Key Takeaways
  1. Call the AWQ model quantization script
  2. Supply the model path and the model name to deploy
  3. Enable zero point quantization and set the Q group size to 128
  4. Set the number of bits used to represent the weight to 4bit
  5. Select the quantization algorithm to use
  6. Run the GPTQ model quantization.py script
  7. Configure GPTQ quantization with chosen settings
  8. Run the quantization process using calibration data
  9. Save the quantized model and tokenizer for deployment
  10. Deploy the model on SageMaker AI hosting or SageMaker AI endpoints
💡 Model quantization techniques, such as AWQ and GPTQ, can significantly reduce model size and improve inference speed, making it possible to deploy large language models on hardware with limited memory.

Related Reads

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →