Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code
A complete tutorial on how to train a model on multiple GPUs or multiple servers.
I first describe the difference between Data Parallelism and Model Parallelism. Later, I explain the concept of gradient accumulation (including all the maths behind it). Then, we get to the practical tutorial: first we create a cluster on Paperspace with two servers (each having two GPUs) and then training a model in a distributed manner on the cluster.
We will explore collective communication primitives: Broadcast, Reduce and All-Reduce and the algorithm behind them.
I also provide a template on how to integrat…
Watch on YouTube ↗
(saves to browser)
Chapters (18)
Introduction
2:43
What is distributed training?
4:44
Data Parallelism vs Model Parallelism
6:25
Gradient accumulation
19:38
Distributed Data Parallel
26:24
Collective Communication Primitives
28:39
Broadcast operator
30:28
Reduce operator
32:39
All-Reduce
33:20
Failover
36:14
Creating the cluster (Paperspace)
49:00
Distributed Training with TorchRun
54:57
LOCAL RANK vs GLOBAL RANK
56:05
Code walkthrough
1:06:47
No_Sync context
1:08:48
Computation-Communication overlap
1:10:50
Bucketing
1:12:11
Conclusion
DeepCamp AI