Generalization and Scaling Laws for Mixture-of-Experts Transformers
📰 ArXiv cs.AI
arXiv:2604.09175v1 Announce Type: cross Abstract: We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared lo
DeepCamp AI