Model Quantization for efficient deployment with Amazon SageMaker AI | Amazon Web Services
Key Takeaways
The video discusses model quantization techniques, including AWQ and GPTQ, for efficient deployment of large language models using Amazon SageMaker AI, and demonstrates how to deploy quantized models using SageMaker and DJL inference serving container.
Full Transcript
Hey everyone, welcome to Amazon SageMaker AI science corner videos, your go-to location for deep data science and large language model content. My name is Pranav Morti and I'm a senior geni data scientist. I work in the Amazon SageMaker AI team. I love building intelligent and autonomous systems. In this video, uh we'll dive into efficient deployment techniques using Amazon SageMaker AI with a special focus on quantization, quantized approaches to deploying large language models. to follow along. Make sure you have access to your GPUbased studio notebook instances. If not, that's okay. You can still run these in SageMaker AI as jobs. So, you'd still have access to the code and you'll still have access to all the processes. In this particular video, we'll cover the two most popular uh types of quantizations uh methods and walk through how to deploy customtuned quantized model on SageMaker AI in just a few steps. to take a step back broadly there are two main approaches to large language model quantization there's offline quantization methods tools like uh AWQ GPTQ and GGUF some of the most popular methods are used to quantize the model weights ahead of time this means that you prepare and save a quantized model before deployment reducing the spite size and speeding up inference right from the start method is online or runtime quantization ation libraries like bits and lights apply quantization dynamically at inference time. This allows you to use standard model weights and benefit from quantization without needing a separate pre-processing steps. Both approaches are widely used to make large language models more efficient for real world deployments especially when running on resource constraint environments. For now, let's focus on two optimization techniques uh and dive deep into that. The first one is activation aware weight quantization technique or AWQ and the GPT quantization technique or GPTQ. So let's take a quick look at uh AWQ activation over rate quantization. What is it and what is it? What is the purpose that it's serving? AWQ is a quantization technique that focuses on minimizing the impact of quantization errors by considering activation statistics during the quantization process. It optimizes how model weights are quantized based on how activations behave in each layer leading to better accuracy retention especially for large language models. AWQ is typically applied offline uh before deployment producing a quantized model ready for efficient inference. This method enables faster inference and smaller model sizes with minimal quality loss. AWQ is especially popular for compressing models for edge or GPU constraint environments. The second method we want to focus on is GPTQ or GPT quantization. GPTQ is an efficient post-training quantization method designed for large language models, especially the GPT family. It quantizes model weights after training, focusing on minimizing output errors layer by layer for high accuracy. The process is performed offline creating a static quantized model for deployment. Similar to AWQ, GPDQ is widely used for deploying large models on hardware with limited memory as it significantly reduces model size and speeds up inference. It strikes a strong balance between model compression and maintaining generation quality. I'd also like to do an honorable mention to bits and bytes from an inference perspective. So let's just take a quick look at what bits and is. Bits and Bites is a library that enables runtime quantization, applying quantization to models dynamically during inference. It supports a range of quantization types such as 8bit and 4bit without needing to pre-quantize model weights. This allows users to quickly experiment with quantization and deploy standard models with reduced memory footprint and compute requirements. Bits and Bites is easy to integrate with popular frameworks like hugging face transformers. It's ideal for rapid prototyping or situations where pre-quantizing the model isn't practical. In the next few minutes, we'll learn how to quantize a model uh with AWQ or GPTQ method. Uh firstly, let's go over the AWQ quantization script first. So uh top to bottom, what we have is a bunch of imports. Um there is a requirement for the AWQ uh which is auto AWQ package which you can uh install from pip. Um we'll take a quick minute but let's just go through all of the different parameters that are available. Um and I'd like to quickly highlight how easy it is to uh quantize a model with hugging face where you just call uh model.quontise. hugging facerw makes it super easy for you to just load the model in provide the tokenizer and out you get the quantized model. Now let's just understand what are these different uh parameters that we have available. The first is the zero point which enables zero point quantization improving range and accuracy of quantized weights. The second parameter to keep in mind is the Q group size. So it effectively sets the number of weights grouped together by quantization. 128 is always the default and for most use cases 128 works really well. Large groups can uh improve efficiency but may slightly impact accuracy. So something to keep in mind as you context these models. The next parameter to keep in mind is the w bit. Specifically the number of bits uh used to represent the weight. Most AWQ models are 4bit compressions. So we recommend keeping it to 4bit but if you need smaller footprint you can go three or two. Uh but keep in mind that it reduces accuracy. The last one is the version. So you get gem m and gem v. Uh it essentially selects a quantization algorithm. Uh one of which is optimized for batch the other is optimized for speed. So in short the script is fairly simple. You've got all of these uh parameters uh available uh for runtime. Now let's just take some default parameters and then quickly quantize a model. So I already have uh a model that was uh fine-tuned upstream available to me. This is just a base model that I fine-tuned a lava 3 to 3 billion model that I fine-tuned using spectrum fine-tuning technique for about 10 epox. And now I would like to convert this model into an AWQ quantized model so I can deploy it for inference. So this is what my script looks like. I have I'm calling the AWK model quantization.py. I supply the model.pat which can be the local pack. Um I provide the uh model name that I'd like to uh deploy it as. Uh and I would like 0 enablement 1284 and gem. And then finally the output directory to where I need or where I want this to be stored. I'm going to remove this path that I'm just going to add as V2. And let me go ahead and paste this in the in the command line. So while this runs, uh I already have a quantized model that's ready to go. So as you can see, all of the layers are being quantized. So AW is going layer by layer and trying to quantize and reduce the footprint of the model. While it does that, let's just take a quick peek at the uh the base model llama 323 billion and the AWQ quantise model. Now just from the outset you can quickly see that on the top this is the model that's available in hugging face. The model is almost 5 and a half gigs in size. Now however the same model post AWQ quantization is less than half. It's only 2.1 GB in size. So it's a pretty significant reduction in footprint. Now, how do you deploy this model? That's the important point. The answer is SageMaker makes it very easy. Let's take a quick look at that after uh we run through the GPTQ quantization method as well. We took a look at how AWQ quantization works. And by the way, the quantization is about to complete, but let's just jump into the GPTQ quantization. It's similar thematically. Um the script quantizes a pre-trained language model using the GPTQ method making it sustainable for efficient inference. We know this already but what are the key parameters uh that goes into GPTQ quantization similar to AWQ we set the bits these uh this can be 4 32 similar to AWQ uh which controls the model compression and efficiency. The second is a group size similar to AWQ. This is a balance of performance and accuracy. We have the calibration size which specifies how many calibration samples are used to estimate quantization parameters improving essentially the output quality. Lastly, uh there is a parameter which is use v2 where you can use the uh second version of quantization which has shown some improvements but this is completely optional. At a high level, what the script does is it loads the pre-trained model and the tokenizer, prepares a calibration data set. Here we're using,24 samples from the uh C4 data set to accurately measure how quantization will affect the model output. Now you can use any data set even your custom data set to uh have the model evaluate and then compute the weights before sending the model in GPTQ context format. And then uh the script configures GPTQ quantization with the chosen settings. Runs the quantization process using the calibration data to minimize accuracy loss. Finally saves the quantized uh model and tokenizer for deployment. So the way we're going to run this is very similar to AWQ. Here we're going to reference the GPTQ model quantization.py. Um once again I'm using a fine-tuned model which was fine-tuned using spectrum finetuning technique robot tenipox and then I'm going to save uh the quant model name with the stopix as GPTQ and I'm going to select four bits with a group size of 128 and a calibration of 1024 samples and then finally the output directory effectively I'll just use the root here and this it's it's as simple as running this. Now one more thing to keep in mind is you can go larger in terms of bits. You can go from four to eight which means that you get slightly higher precision but your model size uh will be larger. So we now have our quantized model ready to go. Now the question you might be asking yourself or I have been asking myself is okay I now have a model but how do I deploy this and scale this out to many users that can leverage my fine-tune model in a scalable manner. So my easy answer to that is you can deploy your model any custom model on SageMaker AI hosting or SageMaker AI endpoints and we'll just quickly walk through how that's done. So we have a notebook here. Uh most of you know this. We import a bunch of uh required uh modules in here at the top. We instantiate uh we find out what region we need to host the model in. We instantiate the SageMaker session and we provide the role uh which is assumed at runtime. I'm going to use uh the uh DJL infant serving container which is very easy. Um I just need to supply a few configuration components. NDEL infrance serving container does the hard work for me about picking up the models uh instantiating it deploying it getting it ready for inference. The way you would deploy uh models with DJL uh inference serving container is you just specify the ECR URI which is managed by AWS. So uh if you know the URI you can just directly plug it in. So I know the version that I need. I need the DJ serving uh DJ inference container uh 0.31 with a certain CUDA version that I know it works for sure. But if you're unsure, you can also use SageMaker image URI retrieve function to actually retrieve um for a given framework and the version you need. Okay, I'm going to name this model something unique that I know. So I'm going to call it custom llama reasoning R1 distilled with some datetime. Now the second thing I I I need to do is I need to upload uh the model to an SC bucket because this is a private model that I own. I can host it on hugging phase which means that it's widely and publicly available. Uh or I can host it in S3 and just allow my DJ gel inference serving container to just tap into S3 and pull the model for deployment which is a much more secure way to do it in this context. So we upload the file to S3. Um this is the path that we're using. Now the same path I'm going to copy into the uh environment configuration uh that is used by DJL container for deployment. So I have the model ID. The model ID can be one of two things. It can be the path S3 path and DJ gel container automatically recognizes what type of model it is or it can be hugging face path which means it pull it from hugging face hub. In addition to that you can specify parameters like what's the max model length how much of GPU resource it needs to use whether you're enabling streaming what is a rolling batch type etc. Uh there are a whole host of uh configuration elements that you can set to get the maximum out of the hosted model endpoint and then simply just deploy. Now one thing you need to keep in mind is what type of instance you'd like to choose in order to deploy these models. Given that we contize these models and these models are just 2 gigs in size, you may be able to host them even on a technically on a 16 GB GPU as well. But I'm using a G52 extra large which cons contains 24GB of memory uh which is plenty for an AWQ model which means that I can increase the number of concurrent requests and that single instant instance can serve so many users within my ecosystem. So once that model is deployed it'll give me a new model endpoint. Now, that may take between 6 to 12 minutes, but I already have a model that's deployed and ready to go. So, I'm going to go back to my SageMaker uh studio, and then I'm going to navigate down to endpoints. And then here I have the custom llama that I predeployed uh just a few minutes ago. So, I'm going to copy the model name here. Go back to my notebook and then I'm going to attach myself to that custom model endpoint name and then I'll instantiate a new predictor. You don't have to do anything else other than just declare a new SageMaker predictor and now you're good to supply uh your text. Now, the one thing that you may have to do may choose to do is format the input before sending it to G DJL endpoints. In this case, that's exactly what I'm doing. But at runtime, you can configure parameters like temperature, top P, top K, max tokens, etc. Now, I'm just asking a simple question. Who are you? And what's 2 + 2? And you can see that the model actually thinks through. And you can see it generated a bunch of tokens in a very, very limited period of time. So that's it. So this is how you deploy a model uh for inference on SageMaker AI whether it's quantized or base models. It's very easy and we hope uh you follow along with us. Thank you.
Original Description
Learn about efficient deployment techniques using Amazon SageMaker AI focusing on various model quantization approaches to deploying models for inference. This video discusses various approaches to quantization and their benefits. Model quantization is a technique used to reduce the computational and memory requirements of large language models, by reducing the precision of the model's parameters and computations, enabling faster, more efficient deployment with minimal accuracy loss.
To learn more, visit https://go.aws/4hUrxiX
Subscribe to AWS: https://go.aws/subscribe
Create a free AWS account: https://go.aws/signup
Try AWS for free: https://go.aws/free
Connect with an expert: https://go.aws/contact
Explore more: https://go.aws/more
Next steps:
Explore on AWS in Analyst Research: https://go.aws/reports
Discover, deploy, and manage software that runs on AWS: https://go.aws/marketplace
Join the AWS Partner Network: https://go.aws/partners
Learn more on how Amazon builds and operates software: https://go.aws/library
Do you have technical AWS questions?
Ask the community of experts on AWS re:Post: https://go.aws/3lPaoPb
Why AWS?
Amazon Web Services is the world’s most comprehensive and broadly adopted cloud, enabling customers to build anything they can imagine. We offer the greatest choice of innovative cloud capabilities and expertise, on the most extensive global infrastructure with industry-leading security, reliability, and performance.
#AWS #AmazonSageMakerAI #SageMaker #AmazonWebServices #CloudComputing
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Amazon Web Services · Amazon Web Services · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Agentic AI Design Patterns Introduction and walkthrough | Amazon Web Services
Amazon Web Services
Galileo on modernizing on banking infrastructure | Amazon Web Services
Amazon Web Services
Alliander Speeds Innovation and Energy Transition Using AWS | Amazon Web Services
Amazon Web Services
AWS and Scuderia Ferrari HP streamline F1 power unit assembly | Amazon Web Services
Amazon Web Services
How AWS machine learning supports Scuderia Ferrari HP pit stops | Amazon Web Services
Amazon Web Services
Nasdaq Builds Market Infrastructure of the Future with AWS | Amazon Web Services
Amazon Web Services
AWS Security Hub Exposure Findings | Amazon Web Services
Amazon Web Services
How do I use Session Manager port forwarding to connect to my EC2 instance through RDP?
Amazon Web Services
How do I extend an EBS volume with LVM partitions?
Amazon Web Services
AWS Graviton makes it easy to optimize performance, cost, and sustainability | Amazon Web Services
Amazon Web Services
Run Cloud Adoption Framework workshops with Miro | Amazon Web Services
Amazon Web Services
Getting Started with AWS Cost Optimization Hub | Amazon Web Services
Amazon Web Services
Why did my Amazon SQS messages get sent to a dead-letter queue?
Amazon Web Services
Declarative Policies for EC2 | Amazon Web Services
Amazon Web Services
How do I troubleshoot IAM permission issues for the Billing and Cost Management console?
Amazon Web Services
Integrity at Scale: Inside the Flo Health Mission | Amazon Web Services
Amazon Web Services
Fueling Success: Small shifts, powerful performance | Amazon Web Services
Amazon Web Services
WEX enhances customer experience with AI-powered chatbot | Amazon Web Services
Amazon Web Services
Accelerate troubleshooting with Amazon CloudWatch investigations | Amazon Web Services
Amazon Web Services
Why is my Windows WorkSpace stuck in the starting, rebooting, or stopping status?
Amazon Web Services
Telemetry Pipelines for AI | Amazon Web Services
Amazon Web Services
Getting Control over Security and Observability Data | Amazon Web Services
Amazon Web Services
The Problem with Telemetry Data Volume | Amazon Web Services
Amazon Web Services
Telemetry Pipelines on AWS | Amazon Web Services
Amazon Web Services
What are Telemetry Pipelines? | Amazon Web Services
Amazon Web Services
Using AI for RegEx on Telemetry Pipelines | Amazon Web Services
Amazon Web Services
Multi-Session Support in the AWS Console | Amazon Web Services
Amazon Web Services
How CloudHedge delivers assessment with AWS ISV Tooling Program at no cost?
Amazon Web Services
How customers speed up migration and modernization to AWS with CloudHedge | Amazon Web Services
Amazon Web Services
Chaos Experiment with Amazon ElastiCache | Amazon Web Services
Amazon Web Services
Amazon S3 Access Points: Easily manage access for shared datasets on S3 | Amazon Web Services
Amazon Web Services
ElastiCache Valkey 8.0 - Savings and Efficiency | Amazon Web Services
Amazon Web Services
Pennymac scales document processing with AWS | Amazon Web Services
Amazon Web Services
AWS | Next Level Innovation | Amazon Web Services
Amazon Web Services
Driving Cloud Innovation: Mindtickle's Partnership with AWS Enterprise Support | Amazon Web Services
Amazon Web Services
A Leader's Edge from Executive Insights | Amazon Web Services
Amazon Web Services
How do I create a custom Amazon WorkSpaces image?
Amazon Web Services
Charles Leclerc tests his AI-generated race track | Amazon Web Services
Amazon Web Services
Redington Scales India’s Cloud Access with AWS Partnership | Amazon Web Services
Amazon Web Services
How do I prevent the resources in my CloudFormation stack from getting deleted or updated?
Amazon Web Services
How do I troubleshoot authentication errors when I use RDP to connect to an EC2 Windows instance?
Amazon Web Services
Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Amazon Web Services
Exploring the Possibilities of Digital Twin & AI at the Edge | Amazon Web Services
Amazon Web Services
AWS at the FORMULA 1 AWS GRAN PREMIO DELL'EMILIA-ROMAGNA 2025 | Amazon Web Services
Amazon Web Services
What's new in RCPs | Amazon Web Services
Amazon Web Services
API Caching using Amazon ElastiCache | Amazon Web Services
Amazon Web Services
Pendula: Amazon Nova Customer Testimonial | Amazon Web Services
Amazon Web Services
InDebted : Amazon Nova Customer Testimonial | Amazon Web Services
Amazon Web Services
Amazon DynamoDB global tables with multi-Region strong consistency | Amazon Web Services
Amazon Web Services
Siemens Mobility uses AWS to operate securely, efficiently on a global scale | Amazon Web Services
Amazon Web Services
How do I reuse a knowledge base session in Amazon Bedrock?
Amazon Web Services
EP5: MBZUAI, CMU : Causal AI, Answering The “Why“ and “What if“ Questions | AWS for AI Podcast
Amazon Web Services
Hema scales time to market developing a data mesh on AWS (Technical) - Cloud Adventures
Amazon Web Services
Hema scales time to market developing a data mesh on AWS (Business) - Cloud Adventures
Amazon Web Services
How Langfuse Scaled Their AI Platform with AWS: From Open-Source to Enterprise | Amazon Web Services
Amazon Web Services
SLMs and LLMs: What’s the Difference? | Amazon Web Services
Amazon Web Services
SLMs and LLMs: When to use them? | Amazon Web Services
Amazon Web Services
SLMs on CPU | Amazon Web Services
Amazon Web Services
Intelligent Model Routing | Amazon Web Services
Amazon Web Services
SLMs, LLMs, and Model Routing in Agents | Amazon Web Services
Amazon Web Services
More on: LLM Foundations
View skill →Related Reads
📰
📰
📰
📰
MyClaw AI Isn’t Another Chatbot — It’s an AI Employee That Actually Gets Work Done
Medium · LLM
Why does AI love the em dash (—)??
Reddit r/artificial
5 prompts de IA que todo profesor debería usar en 2026
Dev.to AI
OpenRouter vs LiteLLM vs Portkey vs a Managed OpenAI-Compatible Gateway
Dev.to · Edward Li
🎓
Tutor Explanation
DeepCamp AI