Let's pretrain a 3B LLM from scratch: on 16+ H100 GPUs, no detail skipped.
We learn to pretrain a 3B parameter LLM across multiple H100 machines from scratch skipping no details. Learn to handle OOM errors, how to develop on cheap GPUs before scaling to multi-GPU. Finally, we end with running multinode with FSDP and explain how to take the model beyond 3B params.
This is a full lecture with no edits or details skipped. At the end of this lecture you will improve your set of skills and intuition needed for pretraining and scaling LLMs beyond a simple demo.
We start tuning and developing on cheap A10G GPUs. Then we run on 8 H100 GPUs and finally scale it to 2 machine…
Watch on YouTube ↗
(saves to browser)
Chapters (28)
Introduction
1:40
Run the Llama template
2:19
Llama template overview
5:00
Run the template on 1 GPU (A10G)
6:20
Monitor GPU memory usage
6:40
Code walkthrough
10:30
How to handle OOM (out of memory) errors
13:20
Connect local VSCode (optional)
14:40
Overview of hyperparameters
15:50
Run a hyperparameter sweep to find the context window
24:50
Speed up by 2x on 4 GPUs (A10G)
29:40
VRAM vs power for profiling
33:07
From 1B to 3B parameters
37:00
How to release ghost GPU memory
42:00
Change to machine with 8 x H100 GPUs
42:20
Number of parameters vs data size
45:00
Hyperparameter sweep results
48:00
3B params on the H100 at 4x speed
54:40
Troubleshoot Tensorboard error
58:40
TensorBoard and artifacts on separate Studio for analysis
1:02:00
Measure cloud costs spent so far
1:05:00
Discuss and view data concerns
1:10:20
Getting to steady state
1:10:50
How to increase speed for the 3B parameter model
1:16:00
How to run DeepSpeed, FSDP and other scaling techniques
1:20:00
Start training with multi-node (multiple machines)
1:28:00
Monitor multi-node training
1:29:00
Summary
DeepCamp AI