Deploying Deep Learning: Quantization, Serving, and Edge AI

Coursera Courses ↗ · Coursera

Open Course on Coursera

Free to audit · Opens on Coursera

Deploying Deep Learning: Quantization, Serving, and Edge AI

Coursera · Advanced ·📐 ML Fundamentals ·2d ago
"Production Deep Learning: Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle — from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp. Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracy–latency tradeoff. Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices. Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will: - Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer: This is an independent educational resource created by Board Infinity for informational and educational purposes only. This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program. All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identif
Watch on Coursera ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

The Threshold Is a Business Decision, Not a Statistical One
Learn how to build a production-grade fraud detection system and why statistical thresholds are business decisions, not just statistical ones
Medium · Machine Learning
Can Your Stress Level Predict How Much You Sleep?
Explore the relationship between stress levels and sleep patterns using data analysis and machine learning techniques to uncover hidden patterns
Medium · Machine Learning
Role of Model Architecture In Inference — Inference Series
Learn how generative AI architecture impacts inference system design and why it matters for efficient model deployment
Medium · Machine Learning
Role of Model Architecture In Inference — Inference Series
Learn how model architecture impacts inference system design in generative AI
Medium · Deep Learning
Up next
Generative Artificial Intelligence Full Course 2026 | Gen AI Tutorial For Beginners | Simplilearn
Simplilearn
Watch →