Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

📰 ArXiv cs.AI

arXiv:2605.16165v1 Announce Type: cross Abstract: Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on t

Published 18 May 2026
Read full paper → ← Back to Reads