Mixture-of-Experts Routing: Visually Explained

Tales Of Tensors · Advanced ·🧠 Large Language Models ·1mo ago
Mixtral “8×7B” can have ~47B total parameters, yet only a small slice activates per token—because a router sends each token to a top-K set of experts and combines their outputs. But MOE isn’t “pick two experts and you’re done.” We’ll walk through the real engineering story: routing math (softmax → top-K → weighted combine), why early MOE suffered expert collapse + load imbalance, and what MOE 2.0 changed with load-balancing loss and shared experts. Then we get practical: the all-to-all communication overhead that can wipe out theoretical speedups, the capacity/overflow tradeoff (and what “…
Watch on YouTube ↗ (saves to browser)

Chapters (7)

The Promise of Mixture-of-Experts (Mixtral 8x7B)
1:15 Dense Models vs Sparse MoE & Conditional Compute
2:05 Router Mechanics: Softmax, Top-K Selection & Combining
2:55 Early MoE Problems: Expert Collapse & Load Imbalance
3:40 Capacity Limits, Overflow & Token Dropping Strategies
4:40 Load-Balancing Loss, Shared Experts & Hybrid Designs
5:40 All-to-All Communication & Multi-GPU Bottle
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)