Mixture-of-Experts Routing: Visually Explained
Mixtral “8×7B” can have ~47B total parameters, yet only a small slice activates per token—because a router sends each token to a top-K set of experts and combines their outputs.
But MOE isn’t “pick two experts and you’re done.” We’ll walk through the real engineering story: routing math (softmax → top-K → weighted combine), why early MOE suffered expert collapse + load imbalance, and what MOE 2.0 changed with load-balancing loss and shared experts.
Then we get practical: the all-to-all communication overhead that can wipe out theoretical speedups, the capacity/overflow tradeoff (and what “…
Watch on YouTube ↗
(saves to browser)
Chapters (7)
The Promise of Mixture-of-Experts (Mixtral 8x7B)
1:15
Dense Models vs Sparse MoE & Conditional Compute
2:05
Router Mechanics: Softmax, Top-K Selection & Combining
2:55
Early MoE Problems: Expert Collapse & Load Imbalance
3:40
Capacity Limits, Overflow & Token Dropping Strategies
4:40
Load-Balancing Loss, Shared Experts & Hybrid Designs
5:40
All-to-All Communication & Multi-GPU Bottle
DeepCamp AI