Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

📰 ArXiv cs.AI

arXiv:2410.21316v2 Announce Type: replace-cross Abstract: Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structure

Published 14 Apr 2026

Read full paper → ← Back to Reads