Let's pretrain a 3B LLM from scratch: on 16+ H100 GPUs, no detail skipped.

Name: Let's pretrain a 3B LLM from scratch: on 16+ H100 GPUs, no detail skipped.
Uploaded: 2024-02-12T12:56:33+00:00
Channel: william falcon
Description: We learn to pretrain a 3B parameter LLM across multiple H100 machines from scratch skipping no details. Learn to handle OOM errors, how to develop on ch...

william falcon · Beginner ·🧠 Large Language Models ·2y ago

We learn to pretrain a 3B parameter LLM across multiple H100 machines from scratch skipping no details. Learn to handle OOM errors, how to develop on cheap GPUs before scaling to multi-GPU. Finally, we end with running multinode with FSDP and explain how to take the model beyond 3B params. This is a full lecture with no edits or details skipped. At the end of this lecture you will improve your set of skills and intuition needed for pretraining and scaling LLMs beyond a simple demo. We start tuning and developing on cheap A10G GPUs. Then we run on 8 H100 GPUs and finally scale it to 2 machine…

Watch on YouTube ↗ (saves to browser)

Chapters (28)

Introduction

1:40 Run the Llama template

2:19 Llama template overview

5:00 Run the template on 1 GPU (A10G)

6:20 Monitor GPU memory usage

6:40 Code walkthrough

10:30 How to handle OOM (out of memory) errors

13:20 Connect local VSCode (optional)

14:40 Overview of hyperparameters

15:50 Run a hyperparameter sweep to find the context window

24:50 Speed up by 2x on 4 GPUs (A10G)

29:40 VRAM vs power for profiling

33:07 From 1B to 3B parameters

37:00 How to release ghost GPU memory

42:00 Change to machine with 8 x H100 GPUs

42:20 Number of parameters vs data size

45:00 Hyperparameter sweep results

48:00 3B params on the H100 at 4x speed

54:40 Troubleshoot Tensorboard error

58:40 TensorBoard and artifacts on separate Studio for analysis

1:02:00 Measure cloud costs spent so far

1:05:00 Discuss and view data concerns

1:10:20 Getting to steady state

1:10:50 How to increase speed for the 3B parameter model

1:16:00 How to run DeepSpeed, FSDP and other scaling techniques

1:20:00 Start training with multi-node (multiple machines)

1:28:00 Monitor multi-node training

1:29:00 Summary

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)