Mixture-of-Experts Routing: Visually Explained

Name: Mixture-of-Experts Routing: Visually Explained
Uploaded: 2026-01-31T17:30:44+00:00
Channel: Tales Of Tensors
Description: Mixtral “8×7B” can have ~47B total parameters, yet only a small slice activates per token—because a router sends each token to a top-K set of experts an...

Tales Of Tensors · Advanced ·🧠 Large Language Models ·1mo ago

Mixtral “8×7B” can have ~47B total parameters, yet only a small slice activates per token—because a router sends each token to a top-K set of experts and combines their outputs. But MOE isn’t “pick two experts and you’re done.” We’ll walk through the real engineering story: routing math (softmax → top-K → weighted combine), why early MOE suffered expert collapse + load imbalance, and what MOE 2.0 changed with load-balancing loss and shared experts. Then we get practical: the all-to-all communication overhead that can wipe out theoretical speedups, the capacity/overflow tradeoff (and what “…

Watch on YouTube ↗ (saves to browser)

Chapters (7)

The Promise of Mixture-of-Experts (Mixtral 8x7B)

1:15 Dense Models vs Sparse MoE & Conditional Compute

2:05 Router Mechanics: Softmax, Top-K Selection & Combining

2:55 Early MoE Problems: Expert Collapse & Load Imbalance

3:40 Capacity Limits, Overflow & Token Dropping Strategies

4:40 Load-Balancing Loss, Shared Experts & Hybrid Designs

5:40 All-to-All Communication & Multi-GPU Bottle

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)