Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Umar Jamil · Beginner ·🛠️ AI Tools & Apps ·2y ago

Skills: ML Maths Basics90%Distributed Systems90%ML Pipelines80%Supervised Learning70%AI Systems Design70%

Key Takeaways

This video provides a comprehensive tutorial on distributed training with PyTorch, covering data parallelism, model parallelism, gradient accumulation, and communication primitives. It includes a practical tutorial on how to train a model on multiple GPUs or multiple servers using PyTorch and cloud infrastructure.

Full Transcript

hello guys welcome back to my Channel today we are going to talk about the distributed training of a model with pytorch what do I mean well imagine you have a model that is very big and you have a lot of data upon which you want to train this model but it doesn't it takes a lot of time to train it because maybe your your GPU is not so powerful or maybe you cannot use a large B size or just you have a lot of data to train this model and you want to train it in a reasonable time one way to do it is to parallelize the training on using multiple gpus or even multiple computers each one having multiple gpus or even one GPU in this video I will show you how this distributed training Works how to integrate it in your existing project and we will combine Theory with practice which means that I will give you all the theory behind distributed training and then also show you how to first of all build the cloud infrastructure to do uh Cloud um training of a model and also how to integrate the distributed training into an existing model that you have already built and it's maybe already training on a single GPU so let's review the topics of today first of all I will give you an introduction to distributed training and show you the two ways in which we can do it one is data parallelism and one in model parallelism in this video I will explore data parallelism we will see a little bit of the neural networks Theory because I want to introduce a concept called gradient accumulation which is very important for distributed training and later we will see distributed training as it is built in py so distributed data parallel we'll see how it works and I will also show show you what we mean by communication Primitives because maybe you have heard terms like all reduce reduce broadcast all gather Etc but you maybe don't know what they are or how they work in this video I will show you the algorithms upon which the these operations are built and how they work when we talk about distributed training I will also show you how to manage failover of a node so imagine you are training on multiple nodes and one of the nodes suddenly dies how we manage this situation we will do a little coding session I will not write code I will just take an existing code and and show you how to add the distributed training to an existing model and I will be using a model that I built in a previous video so it's the video about the how to code a transform model from scratch so if you have already watched my previous video on how to do it it's wonderful if you didn't it's not a problem because anyway the knowledge I will give you today applies to any model that you have built not only the one wrote by me and I will also show you the underline mechanism in which pytorch does distributed training so I will introduce you the computation communication overlap and the bucketing which are optimizations made in the Pythor implementation of the distributed training and so we will explore all these topics let's start our journey so let's start by introducing what is distributed training imagine you want to train a language model on a very big data set for example the entire content of Wikipedia now the data set is quite big because it's made up of millions of Articles and each article has maybe thousands of tokens to train this model on a single GPU may be possible but it poses some challenges first of all the model may not fit on a single GPU this happens when the model has many parameters this happens for example also with the latest llama they have billions of parameters and with so many parameters the ram of the GPU may not be enough or you are forced to use a small B size because a bigger bet size leads to the out of error out of memory error on Cuda or the model may take years to train because the data set is very big if any of the above points applies to you then you need to scale up your training setup and scaling can be done vertically or horizontally let's compare the two options vertical scaling means that you have a single uh machine for example a single GPU which has some RAM and the GPU has some memory for example 4 gigabyte vertical scaling means that you just take the you just buy a bigger computer and or a bigger GPU so you take the existing machine and you upgrade the hardware and this requires no code change because the code that was running on the small machine can also run on the big machine you just maybe need to adjust a little bit of the parameters maybe you can increase the batch size now with horizontal scaling we have we go from a single machine to multiple machines each one interconnected with each other that are communicating for this parallelization uh training and each machine may have one or multiple gpus and this requires code change but this code is min this code change is minimal thanks to pytorch and its implementation of the distributed data parallel and in this video we will explore horizontal scaling because vertical scaling basically we don't need to do anything the code there is no code change so there are two ways of Distributing training of a model one is called Data parallelism and one is called Model parallelism so if the model can fit within a single GPU then we can distribute the training on multiple servers each one having one or more gpus with each gpus processing a subset of the entire data in parallel and synchronizing the gradients during back propagation this technique is known as data parallelism so we have one model that can fit within a single GPU and we have a lot of data what we do is basically we split this data into subsets and we train it on multiple computers in parallel such that each GPU is training on a subset of this entire data set if if the model however cannot fit within a single GPU then we need to break the model into smaller pieces into smaller layers and each GPU process a part of the forward and the backward step during radient descent this option is known as model parallelism so imagine this model initial model doesn't fit on a single GPU so we can break it into layers and each computer is managing a layer not um the entire model and of course we can also combine data parallelism with model parallelism to create a hybrid option in this video we will focus on data parallelism so we pretend that we have a model that is complex but it can fit on a single GPU but we have a lot of training data and it's taking really forever to train it on a single computer so we want to distribute this training on multiple computers or multiple gpus such that each GPU will train on a subset of the data okay we need to review a little bit of the neural networks because I need to introduce a concept called gradient accumulation okay imagine you want to train a neural network to predict the price of a house given two variables the number of bedroom in the house and we will call this variable X1 and the number of bathrooms in the house and we will call this variable X2 we think because of some intuition that the relationship between the input and the output variable is linear so we can write that the predicted price is equal to X1 multiplied by the first weight plus X2 multipli by the second weight plus a bias term our goal is to use stochastic gradi in Des ENT to find the values of the parameters W1 and W2 and bias such that the mean squared error loss between the actual uh house price and the predicted house price is minimized in other way words we want to find W1 W2 and B such that we minimize this quantity here how do we do uh training loop with py torch without gradient accumulation so first of all we uh run the training for a few EPO we take our training data which is made up of X1 X2 and the Target price we calculate the output of the model which is basically y PR is equal to X1 * W1 plus X2 * W2 plus the bias then we calculate the loss we do backward on the loss this will calculate the gradient of the loss function with respect to each parameter and then we update the parameters using the gradient we have calculated I am not using the optimizer I'm just writing the the update rule by hand so using a learning rate I don't use the momentum just for Simplicity and this is actually how the training Loop works right so this is this part here is equivalent to calling Optimizer do step and this part here is equivalent to calling Optimizer do0 let's see it graphically what happens when we do a training Loop like this so imagine we have our expression which represents the model which could in this case it's a very simple expression but imagine it's a very complex model okay uh what pytorch will do it will create a computation graph so it will take our input it will multiply it by the W parameters each input with its own weight then it will combine the sum of the two plus the biased term this will produce an output then we have a target the target is compared with the output to produce produce a loss and then we run back propagation to calculate the uh the gradient of the loss function with respect to the Target so let's visualize the training process one item at a time using our computation graph uh imagine we are training on the input X1 is equal to 6 X2 is equal to two and the target is 15 the first thing our model will do is will start from the two input nodes so X1 and X2 it will multiply by the W weights and we also think that the W weights have been initialized with the following value so the value of this weight is 0.773 this weight is 0.321 and this why it is 0.067 these values could be randomly generated as as we usually do actually uh pytorch will produce the output of this node which is multiplying the X1 value with W1 then it will uh combine the two producing the output of the model it will compare it with the target to produce a loss which is a number but it's also a function so we can calculate the gradient so now usually we call loss. backward to calculate the gradient of the loss function with respect to each parameter so W1 W2 and B pytorch will also compute the intermediate nodes but I will not show them here for Simplicity so we run loss. backward what will loss. backward do it will calculate the gradient of the loss function with respect to each weight how does it do it well the first thing we do it will do it will calculate the gradient of the loss function with respect to the output which can be done like this then it will calculate the gradient of the output with respect to this node I here I am showing only the computation of the gradient of the loss function with respect to the W1 uh node you can do the other ones for exercise for example so next it will compute the um the gradient of the L function with respect to the Z1 node but in order to do it because of the chain rule we need the gradient of the output with respect to the Z1 node and then it will compute the output the gradient of the Z1 node with respect to W1 and this allow us to calculate the gradient of the loss function with respect to W1 these are all the calculations that you need to do in order to get the gradient of this node here and um as you can see we also need to compute the gradient of the intermediate node but we are only interested in the in order to arrive to the gradient of the parameters of the loss function with respect to the parameters and these are the values for the gradient that I that we have calculated for each parameter the next thing we do in during training is we run Optimizer do step this will update the value of each parameter using the gradient that we have calculated and how do we do it we say that the new value of the parameter is equal to the old value of the parameter minus um learning rate multiplied by the gradient why minus because the gradient indicates the direction in which the function grows the most but we want to move against this direction because we want to minimize the the loss so we put a minus sign here and this is how the up value will will be updated using the gradient that we have calculated in the previous step which was the loss. backward function the next thing we do is we run Optimizer do0 so this will zero out all the gradients of each parameter and also the intermediate nodes and then we do the next iteration on the next item so imagine we are training one item at a time so batch size is one uh the next item may have X1 is equal to 5 X2 is equal to two and the target may be 12 this will produce this loss and this output the next thing we do is we run loss. backward loss. backward we calculate the gradient of the loss function with respect to each weight and you can see the gradient here the next thing we do is we run Optimizer do step which will update the value of the parameter using the gradient that we calculated in the previous step and the learning rate and finally we run Optimizer do0 and this will res all the gradient for all the weights we may want to visualize this on this process on the loss function so let's see what happens to the loss function while we are doing this process we started some with some initial weights so remember we started with some randomly initialized weights values then we ran forward step on the First Data item this calculated an output and then we run loss to backward this resulted in a gradient being calculated for the loss function with respect to each weight this indicates a direction in which we should move our weights in order to minimize the loss because we move against the direction of the gradient so the arrow is pointing already in the negative direction of the gradient the next thing we do is we do Optimizer do step Optimizer do step will take our initial weights and will change the weights value according to the negative uh direction of the gradient we then run Optimizer do0 this just resets the value of the gradient and then we run forward on the next data item this will then we run loss dot backward and this will result in a calculation of a gradient which indicates the direction in which we need to move our weights in order to minimize the loss function and then we actually modify our weight using Optimizer do step so the actual Mod update of the parameters values happens when we call Optimizer do step and finally we run Optimizer do0 and this will uh reset the gradient to zero and this is how gradient descent Works without accumulation and without without gradient accumulation at every step so every data item because here we pretend we have a batch size equal to one we update the parameters of the model at every step and um however with gradient accumulation we don't do at every step let's see how it works so the the initial part of the training loop with gradient accumulation is the same so we have run for a few epics we extract the data from our training data so X1 X2 and Target we calculated the loss just like before we run loss. backward just like before the difference is here that we don't update the value of our parameter at every item we uh train our model upon or every batch we train our model upon but we do it at every few items or every few batches in this case uh I do it uh in this training Loop we do it every two items uh because we are extracting one item at a time from our training data and uh so we update the the parameters using the gradient every two items in this case and what happens is that when the first item for example we will arrive to lost. backward we will not run this code here we will restart the loop at the next item we will calculate the output we run again loss dot backward and this loss will this loss. backward will calculate a gradient but this gradient will not replace the the pre the gradient that we calculated in the previous step but it will be accumulated it will summed up to the previous gradient so let's visualize this process step by step imagine we have our first item and this is X1 is equal 6 X2 is equal to 2 and the target is 15 We Run The forward Loop using this item this will result in an output being calculated and using the target we can calculate a loss we then run the loss. backward this will result in a gradient being calculated for this item then we don't update the parameters we do forward step on the next item so note that during this forward step the gradient is not zero because we didn't zero it out in the previous step because because in the previous step we didn't run Optimizer do step or Optimizer do0 so the gradient is still there from the previous item and um okay the second item which is X1 is equal to 5 X2 is equal to two and the target is 12 will result in a loss being calculated and an output we then run loss. backward and pytorch will not replace this new gradient with the previous one will sum it up it will be accumulated that's why we call it gradient accumulation so the new gradient is accumulated with the old one now that we have reached the batch size we can now finally call the optimizer do step method this will result in the values of the parameter being updated using the accumulative the cumulative gradient and then we run Optimizer do0 which will reset the uh gradient to zero and then we can proceed with another loop of two B two items Etc let's visualize what happens to our loss function so to our parameter and the loss function when we do gradient descent with gradient accumulation so our initial weights are here because they are randomly initialized We Run The forward loop on the first uh item this will result in a in an output being calculated by the model then we run Los loss dot backward this will result in the gradient being calculated with respect to the First Data item and this gradient will indicate the direction then we run again R forward on the second data item and then we run loss do backward in the second data item this also will result in gradient being calculated but this two gradient will then be summed up by P torch and this will result in a resulting Vector that indicates a direction and then we optim we run the optimizer so we do Optimizer do step which will update the values of our weights according to the direction indicated by the sum of the two gradients of the two previous data items and then we run Optimizer do zero this will result the gradients to zero for all the nodes and so with gradient accumulation we update the parameters of the model only after we accumulated the gradient of a batch and we can decide how much we want uh this patch to be so gradient accumulation is used not only distributed training it's also used for example when you want to accumulate the gradient without increasing the batch size because maybe a bigger batch size doesn't fit in your GPU so so you can accumulate the gradient for more than one item and then move the parameters because this will result in a more smooth training the loss Will ulate less for example and um now that we have seen how gradient accumulation Works let's see distributed data parallel training in detail so how does it work what we mean by communication Primitives and how do we manage the failover okay distributed data parallel training so imagine you have a training uh script that is running on a single computer but it's very slow because the data set is very big and you can't use a bigger batch size because it will result in an out of memory error on Coda dist distributed data parallel is the solution in this case it it works in the following three scenarios so imagine you have model that can fit on a single GPU and you have a training Loop that works but it's very slow so the first thing you can do is you want to train the model on multiple servers each one having a single GP and this can be done with distributed uh data train distributed data parallel the second setup that you can use is you want to just increas the increase the number of gpus on your existing computer and this one also can be managed by distributed data parallel training the final setup that you can that you may want is you want multiple computers and each one having more than one GPU in this video I will show you how to implement this setup here because the other two setup are just a particular case of this one so this is the most generic setup we can have and I will show you how to uh create the cloud infrastructure for this setup and also how to run the to convert an existing code into a distributed training code and also run it on the cluster that we will create so from now on I will use the term node and GPU interchangeably if a cluster is made up of two computers and each computer has two gpus then we total we will have four nodes or four gpus distributed data parallel works in the following way so at the beginning of the training the model's weights are initialized on one node so on one GPU and then send to all the other nodes using an operation called broadcast each node then trains the model the same model because it started from the same initial weights on a subset of the data so maybe this one is maybe one GPU is a uh training it on the batch number one the second one is on the batch number two the third one on the batch number three etc etc such that there is no overlap between this data and every few batches that these nodes get train this model the gradients of each node are accumulated so summed up into one node and then sent the sum is sent back to all the other nodes this is Operation is known as all reduse then each node updates the parameters of its local mod model using the gradient received by the previous step so the the gradient received by the previous step is the sum of the gradients of all nodes using also its own Optimizer so doing Optimizer do step and then we start again from the step number two so we train it for a few batches we have some local gradient we send our local gradient to a central node which will sum it up sum it up and then send it back to all the nodes then all the nodes will update their parameters using this accumulated gradient and um and then we continue again um let's see all the step all these steps one by one in detail so step number one our models weight are initialized on one GPU so imagine we have a setup with four gpus we have one GPU that will initialize the weight and send it to all the others so now the value of the parameters this is the value of the parameter and this is the gradient so the value of the parameter imagine is 0.1 we only have one parameter so for Simplicity the initial weights are sent to all the other nodes so now all the nodes have the same weights so all the nodes now have the same uh parameters the all the same value the next thing we do is each node runs a forward and backward step on one or more batch of data this will result in a local gradient being calculated because as we saw before when we run L do backward we we have um a gradient being calculated of the loss function with respect to each parameter this local gradient of course may be different for each node because each one is training on a different subset of the data and um this local gradient may also be the accumulation of one or more batch so it's uh imagine that this GPU is training on each GPU is training on three batch so we can accumulate the gradient for Tre batch and of course because we are training on a different subset of the data this cumulative gradient but still local is different for each node the next thing we do is we send all this gradient to every node sends its gradient to one single node and this node will calculate the sum of all the gradient it receives so imagine all the nodes decide to send their own gradient to the first GPU and the sum is calculated on the first GPU then this operation is known as reduce and later we will see how it works then the uh GPU that calculated the sum of all the gradients will send it back to all the other nodes in a in an operation that is known as broadcast and the sequence of reduce and broadcast because here I show it as a separate operation usually is implemented as a single operation known as all reduse and later we will see in detail how it works but this is the key idea so we have the models weights that are initialized on a Model they are sent back to all the other models so now all the other GPU they uh they have the same weights initial weights they train it on a different subset of the data so each one has a local gradient they send each one its own local gradient to one node which will calculate the sum of all these gradients and send it back to all the other nodes so now they have all the same sum then they run optimization steps to update the weights using the same sum so they will produce the resulting weights of the models will be the same for all of them as you can see here so now each node will update the value of their own parameters using the same sum of the gradient you can see here so it's 0.3 they use 0.23 to update the value of their local parameters and they will end up with the same weights so each node updates the parameter of its local model using the gradient received after the updates the gradients are reseted to zero and then we can start with another loop now let's talk about Collective communication primitive so the operation that we T talk I show before in distributed computing environments a node may need to communicate with other nodes if the communication pattern is similar to client and a server then we talk about Point to-point communication because we have one client connects to one server in a request response chains of events however there are cases in which one node needs to communicate to multiple receivers at once for example in the broadcast scenario right we have one note that needs to send its uh weights to all the other nodes this is the typical case of data parallel training in deep learning so one node needs to send its initial weights to all the other nodes moreover all the other nodes need to send their gradient to one single node to receive the cumulative gradient back so Collective communication allows to model the communication pattern between group of nodes let's visualize the difference between these two modes of communication imagine you need to send a file to seven friends with Point to-point communication you would send the file iteratively to each of your friend one by one suppose the internet speed that you have is 1 Megabyte per second and the file is 5 megabyte in size what you do is first you send the imagine you are here so you have the file and you send it to your first friend that it will take 5 second because you are sending 5 megabyte with a speed of 1 Megabyte per second then you send it to the second friend the third the fourth the fifth the sixth and the seventh in total the time to send the file to seven friends will be 35 seconds of course you may say okay but why not send the file simultaneously to all the seven friends at the same time of course you can do it but the problem is your internet speed is still 1 Megabyte per second and the internet speed of sending the file simultaneously to seven friends will be split among these seven friends so each friend will be receiving the file at a speed of 143 KOB uh per second more or less so the total time is still 35 seconds to distribute your file to your seven friends let's see if there is a better way so uh with Collective communication we introduce the first operator is known as broadcast so the operation of sending a data to all the other no all all the other nodes is known as the broadcast operator Collective communication libraries like nccl which is pronounced nickel which is a library from Nvidia assigns a unique ID to each node and this no this unique ID is known as rank suppose you want to send 5 megabyte with an internet speed of 1 Megabyte per second and let's see how the collective communication would work in the case of this broadcast operation the first thing you do is you you are here so you have the file you send it not to this friend here but to the rank number four at the next step we realize that the rank zero and the rank four have the files so they can send simultaneously to another friend so rank zero now will send it to rank two and rank four will send it to rank six but now we realize that rank zero rank two rank four and rank six all have the file so they can simultaneously send to it another friend so in total it will take 15 seconds to distribute your initial file to all the seven friends compared to the 35 second with the point-to-point communication this approach is known as divide and conquer with Collective communication we exploit the interconnectivity between nodes to avoid idle times and reduce the total con communication time this is how also your GPU sends the initial weight to all the other nodes it's not one by one because that would be too slow not it's not even in parallel because anyway the connection speed is always the same it would be split among all the receivers what we do is we do this divide and conquer approach so our initial weights are sent using this broadcast operation in this manner here which is much faster let's see the reduce operation what is the reduce operation reduce operation means that uh we apply it during the training when we want to uh calculate the sum of the gradient of all the nodes so each node has a local gradient which was calculated using a subset of the data but they are all different from each other and what we want is to we want to calculate the sum of the all these gradients into one single node let's see how it works so the broadcast operator is used to send the initial weights to all the other nodes when we start the training loop at every few batches of data being processed by each node the gradients of all the node needs to be sent to one node and accumulated so summed up this operation is known as reduce let's see how it works so imagine initially each node has its own gradient because they are training on a subset of the data and this gradient is all different from each other what we can do is each node will send the gradient to his Adent node so and the Adent node will sum it up with the his own gradient so the node seven for example will send his gradient to Road number six and the no number six the receiver node will be responsible for calculating the sum and the same with the rank number five and four and three and two and one and zero the next step we have the sum here at no rank number four rank number two and rank number zero what we do is we send the rank the sum here from rank number six to rank number four which will calculate the sum of the sum and then also from rank number two to rank number zero the rank number zero will calculate the sum of the sum and then we do a final step in which we send the sum that was present at rank number four to the rank number zero and this sum here is the sum of all the gradients of all nodes and in total it took us only three steps to do it so with only three steps we accumulate the the gradient of all nodes into one single node and it can be proven that the communication time is logarithmic with respect to the number of nodes and this is very typical of all the divide and conquer approaches the all reduce operation so what what we saw before is that we are first broadcasting our data so our initial weights then we are reducing the local gradients and then we need to send it back the sequence of reduce and broadcast is known as all reduce and is usually implemented as a single operation which is faster than doing the sequence of the two operations I will not show the algorithm behind all reduce but you can think of it as logically as a sequence of reduce and broadcast operation but remember that in practice it's implemented as a single operation and it's much faster than doing the two operations alone now imagine you are training your model on multiple gpus and one node crashes so imagine you're training in a distributed scenario like the one shown below and suddenly one node is crashed and in this case two gpus out of four become unavailable how should the system react well one way would be to restart the entire cluster and that's easy however by restarting the cluster the training would restart also from zero because as you remember we start from initial weights that are randomly chosen by one node which are then sent to all the other nodes but this will mean that we would lose all the parameters and computation that we have done so far when this node crashed a better approach is to use checkpointing so checkpointing means that we save the weights of a model on a shared disk every few iteration for example every Epoch and then we resume the training from the last checkpoint in case there is a crash so remember the step in which we initialize the weights of the model in one node and then send it to all the others well instead of just initializing it randomly we just can initialize the weight of the model using the latest checkpoint available so that the training can continue from there we need a shared storage for um saving the checkpoints uh because it's pytorch that will decide which node will initialize the weight and um so every node should have access to this shared storage plus it is good rule in distributed system to not have one node that is more important than the others because any node can fail at any time so we should not make any assumption on which is the node that will initialize the weights it's pych will choose it and actually usually p p chooses the rank number zero but we should not make assumption on which node will be the rank number zero so all nodes should be uh treated equally so they should run the same code and they should have access to the same shared storage and who however when we have a shared storage who should be responsible for saving the check point because if we make a code in which everyone is writing the checkpoint they may end up overwriting each other so what we do is because pytorch will give us will tell us which uh what is the rank of the current node we will write the code in such a way that we check the rank of the current node and if it's the rank number zero then we save the checkpoint if it's rank number one two or three we don't do anything so it means that only one rank will be responsible for saving the checkpoint but later when we restart the training we don't make any assumption on who will become the rank number zero and as the pytorch um documentation says it says that the the rank is not stable means that if you restart the training it the rank number zero may be assigned to another node okay now that we have seen how uh distributed training works at the theoretical level so we accumulate the gradient that is calculated locally by multiple nodes and then send it to one single node which then calculates the sum and send it back to all the others which then update the parameter using this sum of the gradients it's time to actually look at the Practical uh training so we will first build the infrastructure that is needed to run our training and then I will also show you uh what are the code changes that we need to make to our existing training Loop to make it distributed and run it on the infrastructure that we are going to create I will use paper space as a cloud service mostly because it's easy to use and doesn't require much setup time even for beginners I chose it over AWS because AWS has a lot of other setup that you need to do in order to even do simple operations and it's easy to get lost so I'm using paper space mostly for this reason so that anyone without with any kind of level of knowledge can do the same can follow the tutorial very easily so let's get started okay the the first thing we need to do is to go to my repository called pytorch Transformer distributed in which you will find the code of the distributed model that we will be running on this cluster and also the instruction on how to create this cluster on paper space so I have already access my account on paperspace the first thing you will want to do is to create your SSH key just like you do on GitHub so that you can use this uh your public SSH key uh to um you need to associate your public SSH key to your account here so that you can connect to the machines that are created on paper space okay the first uh thing we need to do is to create a private Network on which we will connect all these two machines so we will create two machines that are connected in this private Network and also a shared the dis that can be accessed by both machines we will use a machine that has two gpus so in total we will have four gpus on which we will run the training so let's create our private Network networks here and and we create a new network remember to choose the same region for your computers and the cluster and we will call it um distributed training okay now we have our subnet you can see here the next step to do is to create two nodes of type of we can choose any node actually I I chose this one because I wanted to test two machine each one having two gpus and we use ml in a box as the operating system so as the image of these machines so we create a new machine ml in a box the machine we will do is multi GPU P 4,000 multiplied by two this one the region is the same as the network so New York 2 the dis size 50gb is enough in total I think last time I created this cluster and I ran it for many EPO I spent $5 so I think you should not spend more than $5 for uh running your own uh distributed training uh it should not be more expensive than $5 the first machine we will call it Cuda zero the network we need to select the network that we have created before and we choose public EP dynamic because otherwise we cannot connect to the machine if without a public IP then we never create a snapshot because we don't care about backup for now and the price of running each machine should be around $1 you can see here so we created the first machine and then we create also the second one so ml in a box the machine is this one New York dis size this one we will call it Cuda one distribut the training Dynamic we don't save and create machine last but not least we need to create a network drive of 250 GB we can create this is the smallest one they have available so that's why I choose 250 gigb we will call it model training 250 New York and distributed train this must be belong to the same network as the two machines okay create okay so they are provisioning and starting the two machines now we will configure the machines so we need to install some packages first of all we will need to install if config because we need to have the IP address and while installing if config I ran into a error with a seahorse and I show also how to solve this error we will then Mount the network drive and then we will clone this uh repository install all the packages that we need for running this code and then we initial I also recommend using WTS and biases for uh keep you tracking keeping track of the loss Etc of all the metrics that you need during the training of this model so I recommend you uh register on W and bies and then you install it and use it also for this uh for this code for running this code because it will make you make it easy for you to visualize the training uh metrics okay the ca zero is ready we can see some information here and connect this will give you as the IP address we connect to yes wonderful so now we are inside the machine the first thing we do is we update all the packages okay we do also in install the net tools but I remember it will run into an into an error but looks like this time it doesn't which is good let's try if confus wonderful so now we have the IP address of the first machine which is this one this is the private address that belongs to your subnet that you created here so uh the one you created here 810 okay now we can do the we we need to keep track of this IP address because later we need to modify the host and this is actually needed because I ran into an error with py torch when running the distributor training because it could not um um it could not reach the other node so I had to modify the host by mapping the host name of the other nodes into its IP uh okay I let's Mount the network drive following just the instructions I have written so we install this package let me call this one Coda zero okay then we created the directory in which we want to mount this network drive and then we have to run this command here you can see it here but we need to replace the IP address and the username and password of the drive so let's first paste it then we need to replace this one with the IP address and the network share name which I show you how to find uh we go here drives and we can see here the address so we do but we need to replace the Escape character with the slash slash okay and the network drive username is this one let's run this command the password is also here we can copy it it voila now it should be mounted the first thing we do is we clone our repository so we are in the um home directory of the default user so paper space we can clone it directly here no problem we then CD and then we install the requirements okay now we have installed the requirements so now we can log in into W and vises using this command here but remember to copy the key from the website of Wis and vises so which can be found here let's run and should be okay okay now we are ready to run the training command on the first computer but we need to also prepare the second one so let me prepare the second one of course in a real scenario you would create a Docker file and you would run everything using kubernetes so it should be done automatically but in this case because most of you maybe are not familiar with kubernetes or Docker or you just want to run very fast your training to see how it works so I recommend using paper space because it's very easy to configure all the setup just like before okay now we also clone here so we clone the same repository on both computers and then we run the same code for training but we will see later how to do it okay now we do login with w and bues and now we are ready to run the training on both computers okay the the command that you need to run to in order to run the training on both machine is the same but we first need to uh take the IP address of the computer that we will choose so one of the two computer will become the Rand Deo Master it means that all the communication will be managed by that node and all the others will adopt now of course to make it more uh fail um fail safe we need to create we can use for example a dynamic IP that can be mapped to another machine in case the master crashes so that we can restart the training using always the same IP but in this simple case I will conf configure it by hand otherwise I need an orchestration tool and that will complicate all the scenario so in this case I just want to show you how to do distributed training I don't want to spend too much time creating the perfect infrastructure which you would ideally do in a real scenario so we take this command here and as you can see in this command we need to tell the IP address of the master node here so how which one we will choose the master in my case I will choose Cuda zero so the first machine I have created and the other one will be kind of the slave even if they do both perform the same operation so we find the IP of this slave here which is this one and we put it in the host file of the master and also I need the host name of the slave which is this one perfect so okay okay so we need to paste here the IP address of this node and also its host name which is this one and that's pretty much it now we can start the training so we take the IP address of the master now so Cuda zero is our Master which is this one and we put it in the command in this position here and we CD to the torch and we can run the command so in this command what we are saying is the torch run is the special command that will create the cluster and will manage all the cluster creation and communication it will run our training Loop you can see here first of all we need to tell how many process we have um in this cluster so we have two computers and how many process we want to create for each node so how many Compu how many gpus we have for each computer so this is the number of gpus per computer and this is the number of computer so we have two and two this is a unique ID that indicates this particular class it should be unique for each cluster and this is the backhand library that is managing the communication for us all the parameters after the file name that we will use for training are the arguments that are passed to this file here so in M case we are running with a bet size of eight we are also telling him that the the model folder so where the checkpoint should be saved are this is the shared folder that we created before so the month file of the shared network drive we run the same command on both computers and this will start the training so we do it this one and also on the other one as you can see this computer here is not proceeding it's waiting for the other so now we run all to it here and now it should proceed yeah so now both are proceeding oops I forgot to set uh the h host file on this computer here so I retrieve the IP of this one so IP IP of this one and also the host name of this one and we put in the host file of the other computer so okay let's run again the [Music] training looks like it's proceeding so they are both doing yeah they are both building the data set now the tokenizer so if you have watched my previous video on how to code a Transformer model from zero this is exactly the same code except that we I have added some uh things to manage the distributed training but it's very minimal code change and later I will show you step by step how to how I done so as you can see now the training is running in parallel the first thing you will notice is that the wids and biases is only initialized on one node because we cannot send the metc from multiple nodes because it will interfere with each other so we send the Matrix only from one node and it's the node with rank number zero we will see later how we check this information as you can see they are both training on a subset of the data so this one is training on 910 batch and this one also 910 batch it means that in total we have 1,820 batches in total and uh each one is calculating a local gradient and sending it to the other who will calculate the sum of this gradient actually we have four nodes here because we have four GPU so each GPU is calculating a local gradient and um uh each um each gradient is sent to the rank number zero which will calculate the sum of all these gr gradients using the reduce operation actually the all reduce operation because then it will send back the sum to all the other nodes who will then update the parameters using the sum of these uh gradients that it has received another thing I made a mistake is that this 910 is not multiplied by two because it's not the the rank two is actually later we will see what is the difference between local Rank and Global rank but we have four nodes each node is working on 910 batches of data so I only show one because tqdm uh will other otherwise if I show the tqdm for both uh gpus it will interfere with each other this progress bar basically here so I only show one progress bar per computer not two because each computer computer has two gpus so actually I should have two progress bar but I only show one otherwise the the visualization is very bad so first of all let's navigate the code to understand how it works let me open the project okay let's see here let's start from the train file okay the main difference that we have compared to the original code that I built in my previous video but it it's the same this code will this changes that I'm showing here will apply to any training Loop that you have built so it's not only for this one this particular code it will apply to anyone okay the first thing we do is we read the these two variables that are um so when we run the code with the torch run torch run will insert some environment variables into our environment that we can read one is called Rank and one is called local rank let's see the difference okay um the local rank basically indicates the number of the GPU in the local computer while the global rank or also called just rank indicates the number unique ID of the GPU among all the cluster so if we have four gpus the rank will be unique among all the cluster while the local rank is not unique but it's Unique to the local computer the local rank is useful for example when we want to print for example only on one GPU per each computer while the global rank is useful when we want to only one uh GPU to perform an operation among all the others for example if we want to initialize weights and biases or any other service that should be initialized only from one node in all the cluster then we use the global rank which is this environment of variable here uh while if we want to use something for example for printing or only one local GPU should use the tqdm or the progress bar or any other stuff that is can interfere with each other on the local computer then we use the local rank so the first thing I do is I load this two environment variables and save it into the configuration the second thing I do okay I print the configuration the first thing we need to do to initialize the cluster and this is where the uh torch run will stop waiting for all the nodes to connect is to call this function here in it process group this in it process group belongs to a package that I imported here called torch. distributed so these are the Imports that we need to make in order to use distributed tr

Original Description

A complete tutorial on how to train a model on multiple GPUs or multiple servers. I first describe the difference between Data Parallelism and Model Parallelism. Later, I explain the concept of gradient accumulation (including all the maths behind it). Then, we get to the practical tutorial: first we create a cluster on Paperspace with two servers (each having two GPUs) and then training a model in a distributed manner on the cluster. We will explore collective communication primitives: Broadcast, Reduce and All-Reduce and the algorithm behind them. I also provide a template on how to integrate DistributedDataParallel in your existing training loop. In the last part of the video we review advanced topics, like bucketing and computation-communication overlap during backpropagation. Code: https://github.com/hkproj/pytorch-transformer-distributed PDF slides: https://github.com/hkproj/pytorch-transformer-distributed/blob/main/notes/Slides.pdf Chapters 00:00:00 - Introduction 00:02:43 - What is distributed training? 00:04:44 - Data Parallelism vs Model Parallelism 00:06:25 - Gradient accumulation 00:19:38 - Distributed Data Parallel 00:26:24 - Collective Communication Primitives 00:28:39 - Broadcast operator 00:30:28 - Reduce operator 00:32:39 - All-Reduce 00:33:20 - Failover 00:36:14 - Creating the cluster (Paperspace) 00:49:00 - Distributed Training with TorchRun 00:54:57 - LOCAL RANK vs GLOBAL RANK 00:56:05 - Code walkthrough 01:06:47 - No_Sync context 01:08:48 - Computation-Communication overlap 01:10:50 - Bucketing 01:12:11 - Conclusion

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

This video tutorial covers the basics of distributed training with PyTorch, including data parallelism, model parallelism, and gradient accumulation. It provides a practical guide on how to train a model on multiple GPUs or multiple servers using PyTorch and cloud infrastructure.

Key Takeaways

Split the data into subsets and train on multiple computers in parallel
Break the model into smaller pieces and process each piece on a separate GPU
Accumulate gradients from multiple iterations of the training loop before updating the model parameters
Initialize model weights on one node and broadcast to all nodes
Train model on a subset of data on each node
Accumulate gradients every few batches and send to central node for all-reduce operation

💡 Distributed training with PyTorch can significantly speed up the training process by utilizing multiple GPUs or multiple servers, and gradient accumulation can help reduce the communication overhead.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related Reads

How to build a faceless YouTube channel with AI in 30 days

Learn to build a faceless YouTube channel using AI in 30 days to generate passive income

10 small AI wins that quietly give you your afternoon back

Discover 10 small AI wins to automate dull daily tasks and reclaim your afternoon productivity

Stop Starting With AI Tools. Start With the Workflow.

Rethink your approach to AI adoption by focusing on workflow clarity over tool accumulation, to maximize efficiency and effectiveness

The Fax Is Not the Workflow. The Data Is.

Learn to convert faxed documents into structured JSON using Telnyx AI Inference and Python

Chapters (18)

Introduction

2:43 What is distributed training?

4:44 Data Parallelism vs Model Parallelism

6:25 Gradient accumulation

19:38 Distributed Data Parallel

26:24 Collective Communication Primitives

28:39 Broadcast operator

30:28 Reduce operator

32:39 All-Reduce

33:20 Failover

36:14 Creating the cluster (Paperspace)

49:00 Distributed Training with TorchRun

54:57 LOCAL RANK vs GLOBAL RANK

56:05 Code walkthrough

1:06:47 No_Sync context

1:08:48 Computation-Communication overlap

1:10:50 Bucketing

1:12:11 Conclusion

Make Stunning Tree Diagrams with ChatGPT In Seconds!