Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

📰 ArXiv cs.AI

arXiv:2604.19147v1 Announce Type: cross Abstract: Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism's linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K

Published 22 Apr 2026
Read full paper → ← Back to Reads