SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
📰 ArXiv cs.AI
arXiv:2604.13847v1 Announce Type: cross Abstract: While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we p
DeepCamp AI