MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

📰 Hacker News (AI)

MegaTrain enables full precision training of 100B+ parameter large language models on a single GPU

advanced Published 8 Apr 2026

Action Steps

Understand the limitations of traditional GPU-centric systems in training large language models
Learn about MegaTrain's memory-centric approach, which stores parameters and optimizer states in host memory
Implement pipelined double-buffered execution engine to overlap parameter prefetching, computation, and gradient offloading
Replace persistent autograd graphs with stateless layer templates to eliminate persistent graph metadata

Who Needs to Know This

AI engineers and researchers can benefit from MegaTrain's efficient training of large language models, while software engineers can appreciate the system's memory-centric design and optimizations

Key Insight

💡 MegaTrain's memory-centric design and optimizations enable efficient training of large language models on a single GPU

Key Takeaways

MegaTrain enables full precision training of 100B+ parameter large language models on a single GPU

Full Article

Published Time: Wed, 08 Apr 2026 00:04:10 GMT

# [2604.05091] MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

[Skip to main content](https://arxiv.org/abs/2604.05091#content)

[![Image 1: Cornell University Logo](https://arxiv.org/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/)

[Learn about arXiv becoming an independent nonprofit.](https://tech.cornell.edu/arxiv/)

We gratefully acknowledge support from the Simons Foundation, [member institutions](https://info.arxiv.org/about/ourmembers.html), and all contributors.[Donate](https://info.arxiv.org/about/donate.html)

[](https://arxiv.org/IgnoreMe)

[![Image 2: arxiv logo](https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-one-color-white.svg)](https://arxiv.org/)>[cs](https://arxiv.org/list/cs/recent)> arXiv:2604.05091

[Help](https://info.arxiv.org/help) | [Advanced Search](https://arxiv.org/search/advanced)

Search

[![Image 3: arXiv logo](https://arxiv.org/static/browse/0.3.4/images/arxiv-logomark-small-white.svg)](https://arxiv.org/)

[![Image 4: Cornell University Logo](https://arxiv.org/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/)

GO

## quick links

* [Login](https://arxiv.org/login)
* [Help Pages](https://info.arxiv.org/help)
* [About](https://info.arxiv.org/about)

# Computer Science > Computation and Language

**arXiv:2604.05091** (cs)

[Submitted on 6 Apr 2026]

# Title:MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Authors:[Zhengqing Yuan](https://arxiv.org/search/cs?searchtype=author&query=Yuan,+Z), [Hanchi Sun](https://arxiv.org/search/cs?searchtype=author&query=Sun,+H), [Lichao Sun](https://arxiv.org/search/cs?searchtype=author&query=Sun,+L), [Yanfang Ye](https://arxiv.org/search/cs?searchtype=author&query=Ye,+Y)

View a PDF of the paper titled MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU, by Zhengqing Yuan and 3 other authors

[View PDF](https://arxiv.org/pdf/2604.05091)[HTML (experimental)](https://arxiv.org/html/2604.05091v1)
> Abstract:We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

Subjects:Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Operating Systems (cs.OS)
Cite as:[arXiv:2604.05091](https://arxiv.org/abs/2604.05091) [cs.CL]
(or [arXiv:2604.05091v1](https://arxiv.org/abs/2604.05091v1) [cs.CL] for this version)
[https://doi.org/10.48550/arXiv.2604.05091](https://doi.org/10.48550/arXiv.2604.05091)

Focus to learn more

arXiv-issued DOI via DataCite (pending registration)

## Submission history

From: Zhengqing Yuan [[view email](https://arxiv.org/show-email/745995f7/2604.05091)]

**[v1]** Mon, 6 Apr 2026 18:43:56 UTC (787 KB)

[](https://arxiv

Read full article → ← Back to Reads

MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU

Key Takeaways

Full Article

Related Videos