Serverless LLMs and Agentic AI with Modal – Lesson 2

Name: Serverless LLMs and Agentic AI with Modal – Lesson 2
Uploaded: 2025-12-12T10:39:12+00:00
Channel: BrainOmega
Description: 💖 Support BrainOmega ☕ Buy Me a Coffee: https://buymeacoffee.com/brainomega 💳 Stripe: https://buy.stripe.com/aFa00i6XF7jSbfS9T218c00 💰 PayPal: ht...

BrainOmega · Beginner ·🤖 AI Agents & Automation ·5mo ago

Skills: Tool Use & Function Calling90%Multi-Agent Systems80%Autonomous Workflows80%

💖 Support BrainOmega ☕ Buy Me a Coffee: https://buymeacoffee.com/brainomega 💳 Stripe: https://buy.stripe.com/aFa00i6XF7jSbfS9T218c00 💰 PayPal: https://paypal.me/farhadrh 🎥 In this video, we continue our Serverless LLMs and Agentic AI course with Lesson 2: Scaling & Input Concurrency in Modal. Building on the foundations from Lesson 1, this lesson dives deeper into how Modal actually scales your workloads behind the scenes, and how you can control that behavior for real-world, production-style AI and API workloads. This lesson is fully hands-on and experiment-driven. You’ll work with a simulated API-style function that mimics IO-bound workloads, and you’ll observe how Modal automatically spins containers up and down as demand changes. You’ll then learn how to tune that behavior using container scaling parameters like max_containers, min_containers, and scaledown_window, and how to dramatically change performance by enabling input concurrency, allowing each container to handle many requests at once. By the end of this lesson, you’ll clearly understand the difference between container scaling and input concurrency, when to use each one, and why concurrency is critical for efficient LLM inference, embeddings, and agent-based systems. This lesson prepares you to design fast, cost-efficient serverless AI services instead of blindly scaling infrastructure. 💻 Code on GitHub: https://github.com/frezazadeh/serverless-llm-agentic-ai/blob/main/Lesson2.ipynb ⸻ 📚 What You’ll Learn • How Modal auto-scales containers under load • The difference between container scaling and input concurrency • How to use max_containers, min_containers, and scaledown_window • How @modal.concurrent enables many requests per container • Why concurrency is essential for IO-bound workloads and LLM APIs • How to inspect scaling behavior in the Modal dashboard • How to design efficient serverless AI services instead of over-scaling ⸻ ✅ Why Watch This Lesson? • You’ll understand how ser

Watch on YouTube ↗ (saves to browser)