I explain Fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with Vision Pro

Name: I explain Fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with Vision Pro
Uploaded: 2024-02-08T18:36:52+00:00
Channel: william falcon
Description: Build intuition about how scaling massive LLMs works. I cover two techniques for making LLM models train very fast, fully Sharded Data Parallel (FSDP) ...

william falcon · Beginner ·🧠 Large Language Models ·2y ago

Build intuition about how scaling massive LLMs works. I cover two techniques for making LLM models train very fast, fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with the Vision Pro. I'm excited to see how AR can help teach complex ideas easily. Long time dream of mine to show conceptually, how I visualize these systems. Chapters: 00:00 Introduction 01:02 Two machines each with 2 GPUs 01:37 Transformer models blocks 02:02 Forward pass 02:10 Backward pass 02:43 Fully Sharded Data Parallel introduction 02:51 Layer sharding 03:30 Weight concat 05:25 Memory upper bound …

Watch on YouTube ↗ (saves to browser)

Chapters (19)

Introduction

1:02 Two machines each with 2 GPUs

1:37 Transformer models blocks

2:02 Forward pass

2:10 Backward pass

2:43 Fully Sharded Data Parallel introduction

2:51 Layer sharding

3:30 Weight concat

5:25 Memory upper bound

5:58 Why more GPUs speed up training

7:23 Shard across nodes (machines)

9:20 Sharding a block across nodes

10:14 Another way of seeing sharding

11:30 Understand interconnect bottleneck

12:00 Hybrid sharding

15:00 Pipeline parallelism

16:04 Forward pass in pipeline parallelism

16:10 Intuition around pipeline parallelism

16:50 Future directions on pipeline parallelism

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)