Generalization and Scaling Laws for Mixture-of-Experts Transformers

📰 ArXiv cs.AI

arXiv:2604.09175v1 Announce Type: cross Abstract: We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared lo

Published 13 Apr 2026
Read full paper → ← Back to Reads