Deploying Deep Learning: Quantization, Serving, and Edge AI

External: Coursera Courses ↗ · Coursera

Open Course on External: Coursera

Free to audit · Opens on External: Coursera

Deploying Deep Learning: Quantization, Serving, and Edge AI

Coursera · Advanced ·🏭 MLOps & LLMOps ·1mo ago

Skills: Model Deployment90%Training at Scale70%

Key Takeaways

Deploys deep learning models using quantization, serving, and edge AI with vLLM, Triton, ONNX, and Llama.cpp

Original Description

"Production Deep Learning: Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle — from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp. Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracy–latency tradeoff. Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices. Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will: - Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer: This is an independent educational resource created by Board Infinity for informational and educational purposes only. This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program. All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identif

Watch on External: Coursera ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Model Deployment

View skill →

Tutorial 11- How To Deploy End To End ML Projects In Production AWS Cloud Using CI CD Pipeline

Tutorial 11- How To Deploy End To End ML Projects In Production AWS Cloud Using CI CD Pipeline

Use Amazon SageMaker with PyTorch (Hebrew)

Use Amazon SageMaker with PyTorch (Hebrew)

Automate, Evaluate and Deploy ML Models Confidently

Automate, Evaluate and Deploy ML Models Confidently

Introducing LangSmith Studio and Deployment for LangGraph.js

Introducing LangSmith Studio and Deployment for LangGraph.js

Ryan Herr - After model.fit, before you deploy| JupyterCon 2020

Ryan Herr - After model.fit, before you deploy| JupyterCon 2020

Deploy & Optimize ML Services Confidently

Deploy & Optimize ML Services Confidently

Related Reads

DevOps Took 10 Years to Mature.

MLOps is distinct from DevOps and solves unique problems, requiring a different approach

Medium · DevOps

Praesto: A Kubernetes Operator for Node-Local ML Model Caching with CSI

Learn how Praesto, a Kubernetes Operator, optimizes ML model caching for Node-Local storage with CSI, reducing costs and improving performance

Medium · DevOps

Beyond `ollama run`: Production-Ready DeepSeek R1 Deployment with vLLM and Nginx

Learn to deploy DeepSeek R1 with vLLM and Nginx for production-ready environments, moving beyond local development

Dev.to · Shannon Dias

MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages

Learn to build production monitoring for your MCP server to minimize outages and ensure smooth operation

Pole Pruner How A Rope Lever Shears High Branches

Innoforge Studio