FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training
📰 ArXiv cs.AI
arXiv:2606.08476v1 Announce Type: cross Abstract: Context parallelism (CP) is essential for training large-scale, long-context language models, as it partitions sequences to reduce memory overhead. However, existing CP methods suffer from workload imbalance, inefficient kernels, and redundant communication due to static sequence sharding and key-value (KV) tensor communication. We present FlashCP, a load-balanced and communication-efficient framework for CP training. FlashCP introduces a shardin
DeepCamp AI