Building AI Agents That Survive Production

MLOps.community · Beginner ·🤖 AI Agents & Automation ·1w ago
Haytham Abuelfutuh, Co-founder and CTO of Union.ai and co-author of the open-source orchestrator Flyte, opens the AI Agents 2026 conference in Seattle with a brutally simple message: stop trying to design AI agents that never fail. Build agents that fail cheaply and recover automatically. In this 25-minute talk, Haytham walks through the three design principles every production agent needs — the 3 D's: Dynamic, Durable, and Defended — and shows what each one actually requires from your platform. He grounds it in a real case study with Dragonfly, who took a laptop prototype to a production agent system indexing 250,000+ products in a single sitting on Flyte 2. Topics covered: - The travel agent thought experiment: what 18 years of human agents teach us about long-running sessions, dropped calls, and not asking the user the same question twice - The show-of-hands problem: why so many teams build agents but so few ever ship them - The full taxonomy of agent failure: semantic errors, infrastructure errors, network errors, API throttling, and corrupt context - Dynamic: why agent platforms must run native Python instead of forcing you into a constrained DSL for branching and loops - Durable: declaring infrastructure inside your code so agents can react to OOMs, spot machine preemption, and crashes - Crash recovery for long-running sessions: caching non-deterministic LLM calls and tool calls so agents can resume from the last checkpoint - Cross-session caching: when to share LLM outputs across users and when to recompute - Defended: sandboxing agent-generated code with Pydantic Monty and network-isolated execution environments - Human-in-the-loop bailouts when the agent has exhausted its retries - Dragonfly case study: a four-tier agent architecture (catalog, coordinator, researcher, tools) for product recommendation across 250K+ products - Q&A: why Union.ai uses Go and Rust under the Python SDK, and how platform teams can shift agent infrastructure left to developers wi
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Getting Started With Agent-to-Agent aka A2A Protocol
Learn how the Agent-to-Agent (A2A) protocol enables coordinated work among isolated AI agents and its importance for AI engineers
Medium · AI
Getting Started With Agent-to-Agent aka A2A Protocol
Learn how Agent-to-Agent (A2A) protocol enables coordinated AI workforces and its importance for AI engineers
Medium · Python
Getting Started With Agent-to-Agent aka A2A Protocol
Learn about the Agent-to-Agent (A2A) protocol, which enables coordinated workforce among isolated AI agents, and its importance for AI engineers
Medium · LLM
One MCP Server or Ten? The Architecture Decision That Can Make or Break Your AI Agent
Learn how to architect your AI agent's infrastructure to ensure scalability and reliability, a crucial decision for e-commerce applications
Medium · Python
Up next
Build & Automate ANYTHING With Hermes Agent
Julian Goldie SEO
Watch →