Scaling Meta's Multi-Agent Systems to a Billion Videos

Name: Scaling Meta's Multi-Agent Systems to a Billion Videos
Uploaded: 2026-05-11T16:45:48Z
Channel: MLOps.community
Description: Aditya Gautam (Generative AI, Meta) breaks down how Meta tackles two of the hardest problems in short-form video at the scale of hundreds of millions to...

MLOps.community · Advanced ·🤖 AI Agents & Automation ·1w ago

Skills: Multi-Agent Systems90%

Aditya Gautam (Generative AI, Meta) breaks down how Meta tackles two of the hardest problems in short-form video at the scale of hundreds of millions to a billion reads per day: modality misalignment and original content theft. He's spent years on Facebook Reels integrity and recommendations, and in this talk he gets practical about why a single giant LLM is the wrong tool for the job, and what to build instead. This is the application-layer view of multi-agent systems. Not infra, not orchestration theory, just what actually works when your input is messy user-generated video and your budget is a fraction of a cent per inference. What you'll learn: - The two problems Meta is solving: Modality misalignment (text says one thing, video shows another, sometimes only for a few frames) and content theft in the copycat creator economy. - Why a single big LLM falls over: Cost at 100M+ daily inferences, modality bias in VLMs, and the inability to bring in external context like the rest of the video corpus. - The 3-agent architecture: Perceiver (signal acquisition, scene boundaries, VLM embeddings, OCR), Retriever (KNN over vector DBs, similarity matrices, creator-to-creator graphs), and Reasoner (chain-of-thought, confidence scores, can request more context). - Why small specialized models win: 3B to 11B fine-tuned LLMs per agent beat a 200B generalist on both quality and cost, and let each agent ship on its own CI/CD like a microservice. - The evaluation stack: Precision/recall/F1, isolated retrieval evaluation, reasoning quality via LLM-as-judge plus human-in-the-loop, hallucination rate, and full system efficiency logging at every hop. - Four real optimizations that drop cost 10x: Spatial and temporal frame merging (don't send 300 similar cloud frames to a VLM), semantic hashing of viral content, dynamic routing that skips reasoning for obvious copies, and metadata pruning based on creator reputation and topic safety. - The honest hard parts: Inference latency stacks a

Watch on YouTube ↗ (saves to browser)