Distributed TensorFlow training (Google I/O '18)

TensorFlow · Beginner ·🧬 Deep Learning ·8y ago

Key Takeaways

This video demonstrates how to use TensorFlow for distributed training of machine learning models, including data parallelism, synchronous parameter server approach, and synchronous all reduce approach, with tools such as TensorFlow, TPU, Tesla P 100, GPU, and CPU. It also covers how to use the distribution strategy API for faster training and how to build efficient input pipelines using TF Data API.

Full Transcript

[Music] my name is Priya and I'm Anjali we're both software engineers on the tensorflow team working on distributed tensorflow we're so excited to be here today to tell you about distributor tensorflow training let me grab the clicker okay hopefully most of you know what tensorflow is it's an open-source machine learning framework used extensively both inside and outside Google for example if you tried the smart compose feature that was launched a couple of days ago that feature uses tensorflow tensorflow allows you to build train and predict using neural networks such as this in training we learn the parameters of the network using data training complex neural networks with large amounts of data can often take a long time in the graph here you can see you can see the training time on the x-axis and the accuracy of predictions on the y-axis this is taken from training an image recognition model on a single GPU as you can see it took more than 80 hours to get to 75% accuracy if you have some experience running complex machine learning models this might sound rather familiar to you and it might make you feel something like this if your training takes only a few minutes to a few hours you'll be productive and happy and you can try out new ideas faster when it starts to take a few days maybe you could still manage and run a few things in parallel when it starts to take a few weeks your progress will slow down and it becomes expensive to try out every new idea and when it starts to take more than a month I think it's not even worth thinking about and this is not an exaggeration training complex models such as the resonate 50 that we'll talk about later in the talk can take up to a week on a single but powerful GPU like a Tesla P 100 so natural question to us is how can we make training fast there are a number of things you can try you can use a faster accelerator such as a TPU or tensor processing units I'm sure you've heard all about them in the last couple of days here your input pipeline might be the ball leg so you can work on that and make that faster there are a number of guidelines on the tensorflow website that you can try and improve the performance or your training in this talk will focus on distributed training that is running training in parallel on multiple devices such as CPUs GPUs or TP use in order to make your training faster with the techniques that we'll talk about in this talk you can bring down your training time from weeks two hours with just a few lines of code and a few powerful GPUs in the graph here you can see the images per second processed while training an image recognition model as you can see as we increase the number of GPUs from one to four to eight the images per second process it can almost double every time we'll come back to these performance numbers later with more details so before diving into the details of how you can get that kind of scaling and tensorflow first I want to cover a few high-level concepts and architectures in distributed training this will give us a strong foundation with which to understand the various solutions as your focus on training today let's take a look at what a typical training loop looks like let's say you have a simple model like this with a couple of hidden layers each layer has a bunch of weights and biases also called the model parameters or trainable variables a training step begins with some processing on the input data we then feed this input into the model and compute the predictions in the forward pass we then compare the predictions with the input label and compare to compute the loss then in the backward pass we compute the gradients and finally we update the model's parameters using these gradients this whole process is known as one training step and the training loop repeats this training step until you reach the desired accuracy let's say you begin your training with a simple machine under your desk with a multi-core CPU luckily tensorflow handles scaling onto a multi-core CPU for you automatically next you may speed up by a dagger accelerator to your machine such as a GPU or a TPU with distributed training you can go even further you can go from one machine with a single device to one machine with multiple devices and finally to multiple machines with possibly multiple devices each connected over the network with little with a number of techniques eventually it's possible to scale to hundreds of devices and that's indeed what we do in a lot of Google systems by the way in the rest of this talk we'll use the terms device worker or accelerator to refer to processing units such as GPUs or GPUs so how does distributor training work like everything else in software engineering there are a number of ways to go about when you think about distributing your training what approach you pick depends on the size of your model the amount of training data you have and the available devices the most common architecture in distributed training and what is what is known as data parallelism in data parallelism we run the same model and compute on each worker but with a different slice of the input data each device computes the loss and the gradients we use these gradients to update the models parameters and the updated model is then used in the next round of computation there are two common approaches when you think about how do you update the model using these gradients the first approach is what is known as a synchronous parameter server approach in this approach we designate some some devices as parameter servers as shown in blue here these servers hold the parameters of your model others are designated as workers as shown in green here workers do the bulk of the computation each worker fetches the parameters from the parameter server it then confused the last ingredients it sends the gradients back to the parameter server which then updates the models parameters using these gradients each worker does this independently so this allows us to scale this approach to a large number of workers this has worked well for many models in Google where training workers might be preempted by high priority production jobs or where this asymmetry between the different workers or where machines might go down for regular maintenance and all of this doesn't hurt the scaling because the workers are not beating on each other the downside of this approach however is that workers can get out of sync their computing their gradients on steel parameter values and this can delay convergence the second approach is what is known as synchronous all reduce this approach has become more common with the rise of fast accelerators such as keep use or GPUs in this approach each worker has a copy of parameters on its own the no special parameter servers each worker confused the lost ingredients based on a subset of training samples once the gradients are computed the workers communicate among themselves to propagate the gradients and update their model parameters all the workers are synchronized which means that the next round of computation doesn't begin until each worker has received the updated gradients and a barrier that's Mauro when you have fast devices in a controlled environment the variance of step time between the different workers can be small when combined with strong communication links between the different devices overall overhead of synchronization can be small so whenever practical this approach can lead to faster convergence a class of algorithms called all reduce can be used to efficiently combine the gradients across the different workers all reduce aggregates the values from different workers for example by adding them up and then copying them to the different workers it's a fuse algorithm that can be very efficient and it can reduce the overhead of synchronization of gradients by a lot there are many all reduce algorithms available depending on the type of communication available between the different workers one common algorithm is what is known as Ringold reduce in recoil reduce each worker sends his gradients to a successor on the ring and receives gradients from its predecessor there are a few more such rounds of gradient exchanges I won't be going into the details here but at the end of the algorithm each worker has received a copy of the combined gradients ring all reduce uses network bandwidth optimally because it uses both the upload and the download bandwidth at each worker it can also overlap the gradient computation at lower layers in the net with transmission of radiance at the higher layer which means it can further reduce the training time wrinkle reduce is just one approach and some hardware vendors supplies specialized implementations of all reduce for their hardware for example the NVIDIA nickel we have a team in Google working on fast implementations of all reduce for various device topologies the bottom line is that all reduce can be fast when working with multiple devices on a single machine or multiple image or a small number of machines so given these two broad architectures and data parallelism you may be wondering which approach should you pick there isn't one right answer parameter server approach is preferable if you have a large number of not so powerful or not so reliable machines for example if you have a large cluster of machines with just CPUs the synchronous already's approach on the other hand is preferable if you have fast devices with strong communication links such as TP use or multiple GPUs on a single machine parameter server approach has been around for a while and it has been supported well in tensorflow Tipu's on the other hand use already would all reduce approach out of the box in the next section of this talk we'll show you how you can scale your training using the all reduce approach on multiple GPUs with just a few lines of code before I get into that I just want to mention another type of distributed training known as model parallelism that you may have heard of a simple way to think about model parallelism is when your model is so big that it doesn't fit in the memory of one device so you divide the model into smaller parts and you can do those computations on different workers with the same training samples for example you could put different layers of your model on different devices these days however most devices have big enough memory that most models can fit in their memory so in the rest of this talk we'll continue to focus on data parallelism now that you're armed with fundamentals of distributed training architectures let's see how you can do this intensive flow as I already mentioned we're going to focus on scaling to multiple GPUs with the all reduce architecture in order to do so easily I'm pleased to introduce a new distribution strategy API this API allows you to distribute your training intensive flow with very little modification to your code with distribution strategy API you no longer need to place your ops or parameters on specific devices you don't need to worry about structuring your model in a way that the gradients or losses across devices are aggregated correctly distribution does so distributions Riley does that for you it is easy to use and fast to Train now let's look at some code to see how you can do this intense how we can use this API in our example we're going to be using tens of flows high-level API call estimator if you use this API before you might be familiar with the following snippet of code to create a custom estimator it requires three arguments the first one is a function that defines your model so it defines the parameters of your model how you compute the loss and the gradients and how you update the models parameters the second argument is the directory where you want to persist the state of your model and the third argument is a configuration called run config where you can specify things like how often you want to checkpoint how often summaries should be saved and so on in this case we've used the default run config once you create the estimator you can start your training by calling the train method with the input function that provides your training data so given this code to do the training on one device how can you change it to run on multiple GPUs you simply need to add one line of code instantiate something called mirrored strategy and pass it to the run config call that's it that's all the code changes you need to scale this code to multiple GPUs mirrored strategy is the type of distribution strategy api that i just mentioned with this api you don't need to make any changes to your model function or your input function or your training loop you don't even need to specify your devices if you want to run on all available devices it will automatically detect that and run your training on all available GPUs so that's it those are all the code changes you need this API is available in TF contrib and you can use it you can try it out today let me quickly talk about what mayor strategy does mirror strategy implements the synchronous all reduce architecture that we talked about out of the box for you in merge strategy the models parameters are mirrored across the various devices hence the name mirrored strategy each device computes the loss and gradients based on a subset of the input data the gradients are then aggregated across the workers using an all reduce algorithm that is appropriate for your device topology as I already mentioned with mirrored strategy you don't need to make any changes to your model or your training loop this is because we've changed underlying components of tensor flow to be distribution aware for example optimizer batch norm summaries etcetera you don't need to make any changes to your input function either as long as you're using the recommended tensor flow data set API saving and checkpointing works seamlessly so you can save with one or no distribution strategy and resume with another and summaries work as expected as well so you can continue to visualize your training intensive old mere strategy is just one type of distribution strategy and we're working on a few others for a variety of use cases I'll now hand it off to Anjali to show you some cool demos and performance numbers thanks Priya for the great introduction to Murad strategy before we run the demo let us get familiar with a few configurations I'm going to be running the resonate 50 model from the tensor flow model garden ResNet 50 is an emulsification model that has 50 layers it uses skip connections for efficient gradient flow the tensor flow model garden is a repo where there are collection of different models they're written in tensor flow high level ap is so if you are new to tensor flow this is a great resource to start with I'm going to be using the image net data set is input to model training the image net data set is a collection of over a million images that have been categorized into a thousand labels I'm going to instantiate the n1 standard instance on GCE and attach eight nvidia tesla v 100's or voltage GPUs let's run the demo now as I mentioned I'm creating an N 1 standard instance attaching 8 Nvidia Tesla V 100 or voltage GPUs I also attach SSD disk this contains the image net data set which is input to our model training to run a tensor flow model we need to install a few drivers and pet packages and here is a gist with all the commands required I'm going to make this just public so you can set up an instance yourself and try running the model let's open an SSH connection to the instance by clicking on a button here this should bring up a terminal like this so I've already cloned the garden model repo we're going to be running this command inside the resonate directory we're going to run the image net main file so we're using the image net data set a bad-sized of 1024 or 128 per GPU a model directory is going to point to the GCS bucket that's going to hold our check points and summaries that we want to save we point our data directory to the SSD disk well charge the image net data set and the number of GPUs is 8 over which we want to distribute or trade our model so let's run this model now and as the model is starting to train let's take a look at some of the code changes that are involved in the interchange the resinate model function so this is a resonate main function in the garden model repo first we instantiate the mirrored strategy object then we pass it to the run config as part of the Train distribute argument we create an estimator object with the run config and then we call trained on this estimator object and that's it those are all the code changes you need to distribute the ResNet model let's go back and see how our training is going so we've run out for a few hundred steps at the bottom of the screen you should see a few metrics the loss is descri cing steps per second learning rate let's look at tensor board so this is from a run where I've run the model for 90,000 steps a little over that so it's not the run we just started so the orange and red lines are the training and evaluation losses so as a number of steps increase see the loss decreasing let's look at evaluation accuracy and this is when we're training ResNet 50 or 8 GPUs so if we see that around 91 thousand steps we were able to achieve a 75 percent accuracy let's see what it what this looks like when we run it on a single GPU so let's toggle the tensor board buttons on the left and look at the Train and evaluation loss curves when we train our model on one GPU so the blue lines are one CPU and red are orange and eight and you can see that the loss doesn't decrease as rapidly as it does with eight GPUs here are the evaluation accuracy curves were able to achieve a higher accuracy when we distribute our model across eight CPUs as opposed to one let's compare using wall time so we've run the same model for the same amount of time and when we run it over multiple GPUs we were able to achieve higher accuracy faster or trained our model faster let's look at a few performance benchmarks on the DG X 1 3 2 X 1 is a machine on win which on which we run deep learning models we're running miss mixed precision training with a per GPU batch size of 256 it also has 8 volta or v 100 GPUs so the graph shows x-axis the number of GPUs on the x-axis and images per second on the y-axis so as we go from one GPU to a we are able to achieve a speed-up of 7x and this is performance right out of the box with no tuning we're actively working on improving performance so that you are able to achieve more speed up and get more images per second when you distribute your model across multiple GPUs so far we've been talking about the core part of model training and distributing your model using Murat strategy okay so let's say now you have deployed your model on multiple GPUs you're going to expect to see the same kind of boost in images per second when you do that but that may you may not be able to view as many images per second as compared to one CPU you may not see the boost and performance and the reason for that is often the input pipeline when you run your model on a single GPU the input pipeline is pre processing the data and making the data available on the GPU for training but GPUs or TP use as you know process and compute data much faster than a CPU this means that when you distribute your model across multiple GPUs the input pipeline is also not able to keep up with the training it quickly becomes a bottleneck for the rest of the talk I'm going to show you how tensorflow makes it easy for you to use TF the data API is to build efficient and performant input pipelines here's a simple input pipeline for ResNet 50 we're going to use TF da-ta-da api's because data sets are awesome they helped us build complex pipeline using simple reusable pieces when you have lots of data and different data formats and you want to perform complex transformations on this data you want to be using T of data API is to build your input pipeline first we are going to use the list files API to get the list of input files that contain your image and labels then we are going to read these files using the TF record data set reader we're going to shuffle the records repeat them a few times depending on if you want to run your model for a couple of epochs and finally apply your map permission so this processes each record and applies a transformation such as cropping flipping image decoding and finally batch the input and finally batch the input into a batch size that you desire the input pipeline can be thought of as an ETL process which is extract transform and load process in the extract phase we are reading from persistent storage which can be local or remote in the transform phase we are applying the different transformations like shuffled repeat map and batch and finally in the load phase we are providing this processed data to the accelerator for training so how does this apply to the example that we just saw in the extract phase we list the files and read it using the TF record data set reader in the transform phase we apply the shuffle repeat map and batch transformations and finally in the load phase we tell sensor flow how to grab the data from the data set this is what our input pipeline looks like we have the extract transform and load phases happening sequentially followed by the training on the accelerator this means whether when the CPU is busy pre-processing the data the accelerator is idle and where the accelerator is training your model the CPU is idle but the different phases of the ETL process use different hardware resources for example the extract step uses the persistent storage the transform step uses a different course of the CPU and finally the training happens on the accelerator so if we can paralyze these different phases then we can overlap the pre processing of data on the CPU with training of the model on the GPU this is called pipelining so we can use pipelining and some parallelization techniques to build more efficient import pipelines let us look at few of these techniques first you can parallelize file reading let's say you have a lot of data that sharded a car across a cloud storage service you want read multiple files and parallel and you can do this using the non parallel reads call as when you instantiate the Tierra when you call a TF record data set API this allows you to increase your effective throughput we can also paralyze map function for transformations you can data paralyze the different transformation for of the map function by using the numpad L calls argument typically the num the argument we provide is a number of cores of the CPU and finally you want to call prefetch at the end of your input pipeline prefetch decouples the time the data is produced from the time it is consumed this means that you can buffer data for the next training step while the accelerator is still training the current step this is what we had before and this is what we can get an improvement on here the different phases of the input pipeline are happening in parallel were training we are able to see that the CPU is pre processing data for the training step 2 while the accelerator is still training step 1 neither the CPU now the accelerator is idle for long periods of time the training time is now a maximum of pre processing and training on the accelerator as you can see the accelerator is still or not 100% utilized there are few advanced techniques that we can add to our input pipeline to improve this we can use fuse transformation ops of some of these API calls shuffle and repeat for example can be replaced by its equivalent fused up so this paralyzes buffering elements for epoch n plus 1 while producing elements for epoch n we can also replace map and batch with its equivalent fused up this paralyzes paralyzes the map transformation with adding the input tensors to batch with these techniques we are able to process data much faster and make it available to the accelerator for training and improve the training speed I hope this gives you a good idea of how you can use TF data API is to build efficient and performant input pipelines when you train your model so far we've been talking about training on a single machine and multiple devices but what if you wanted to train on multiple machines you can use as the estimators train and evaluate API train and evaluate API uses the acing parameter server approach this API is used widely within Google and it scales well to a large number of machines here's a link to the API where you can learn more on how to use it we're also excited to be working on a number of new distribution strategies we're working on a multi machine Mirage strategy which allows you to distribute your module across many machines with many devices we're also working on adding distribution strategy support to TP use and directly in TF dot chaos in this talk we've talked a lot about the different concepts related to distributor training architectures in API but when you go home today here are three things for you to keep in mind when you train your model distribute your training to make it faster to do this you want to use distribution strategy api's they're easy to use and they're fast input pipeline performance is important use TF data api is to build efficient input pipelines here are a few tenths of flow resources first we have the distribution strategy API you can try using mirrored strategy to train your model across multiple GPUs here's a link to the resonate 50 model garden example so you can try running this example it has mirrored strategy API support enabled here's a link also to the input pipeline performance guide which has more techniques that you can use to build efficient input pipelines and here's the link to the gist that I mentioned in the demo you can try setting up your own instance on running the resonate 50 model garden example thank you for attending our talk and we hope you had a great IO [Music]

Original Description

To efficiently train machine learning models, you will often need to scale your training to multiple GPUs, or even multiple machines. TensorFlow now offers rich functionality to achieve this with just a few lines of code. Join this session to learn how to set this up. Rate this session by signing-in on the I/O website here → https://goo.gl/sBZMEm Distribution Strategy API: https://goo.gl/F9vXqQ https://goo.gl/Zq2xvJ ResNet50 Model Garden example with MirroredStrategy API: https://goo.gl/3UWhj8 Performance Guides: https://goo.gl/doqGE7 https://goo.gl/NCnrCn Commands to set up a GCE instance and run distributed training: https://goo.gl/xzwN4C Multi-machine distributed training with train_and_evaluate: https://goo.gl/kyikAC Watch more TensorFlow sessions from I/O '18 here → https://goo.gl/GaAnBR See all the sessions from Google I/O '18 here → https://goo.gl/q1Tr8x Subscribe to the TensorFlow channel → https://goo.gl/ht3WGe #io18 event: Google I/O 2018; re_ty: Publish; product: TensorFlow - General; fullname: Priya Gupta, Anjali Sridhar; event: Google I/O 2018;
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from TensorFlow · TensorFlow · 44 of 60

1 The TensorFlow YouTube Channel is Here!
The TensorFlow YouTube Channel is Here!
TensorFlow
2 Answering Your TF Questions #AskTensorFlow
Answering Your TF Questions #AskTensorFlow
TensorFlow
3 Chatting With the TensorFlow Community (TensorFlow Meets)
Chatting With the TensorFlow Community (TensorFlow Meets)
TensorFlow
4 All About TensorFlow Code (Coding TensorFlow)
All About TensorFlow Code (Coding TensorFlow)
TensorFlow
5 TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow
6 Keynote (TensorFlow Dev Summit 2018)
Keynote (TensorFlow Dev Summit 2018)
TensorFlow
7 tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
TensorFlow
8 Eager Execution (TensorFlow Dev Summit 2018)
Eager Execution (TensorFlow Dev Summit 2018)
TensorFlow
9 Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
TensorFlow
10 Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
TensorFlow
11 The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
TensorFlow
12 Distributed TensorFlow (TensorFlow Dev Summit 2018)
Distributed TensorFlow (TensorFlow Dev Summit 2018)
TensorFlow
13 Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
TensorFlow
14 TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow
15 Searching Over Ideas (TensorFlow Dev Summit 2018)
Searching Over Ideas (TensorFlow Dev Summit 2018)
TensorFlow
16 Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
TensorFlow
17 Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
TensorFlow
18 Open Source Collaboration (TensorFlow Dev Summit 2018)
Open Source Collaboration (TensorFlow Dev Summit 2018)
TensorFlow
19 Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
TensorFlow
20 TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow
21 Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
TensorFlow
22 Real-World Robot Learning (TensorFlow Dev Summit 2018)
Real-World Robot Learning (TensorFlow Dev Summit 2018)
TensorFlow
23 TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow
24 Project Magenta (TensorFlow Dev Summit 2018)
Project Magenta (TensorFlow Dev Summit 2018)
TensorFlow
25 TensorFlow Dev Summit 2018 - Livestream
TensorFlow Dev Summit 2018 - Livestream
TensorFlow
26 Introducing TensorFlow Lite (Coding TensorFlow)
Introducing TensorFlow Lite (Coding TensorFlow)
TensorFlow
27 TensorFlow Dev Summit 2018 Highlights
TensorFlow Dev Summit 2018 Highlights
TensorFlow
28 Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
TensorFlow
29 TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow
30 Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
TensorFlow
31 Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
TensorFlow
32 TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow
33 Using the tf.data API to build input pipelines (TensorFlow Meets)
Using the tf.data API to build input pipelines (TensorFlow Meets)
TensorFlow
34 Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
TensorFlow
35 Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
TensorFlow
36 TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow
37 Get started with TensorFlow's High-Level APIs (Google I/O '18)
Get started with TensorFlow's High-Level APIs (Google I/O '18)
TensorFlow
38 TensorFlow for JavaScript (Google I/O '18)
TensorFlow for JavaScript (Google I/O '18)
TensorFlow
39 TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow
40 Get started with TensorFlow's High-Level APIs in 5 mins |  Google I/O 2018
Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018
TensorFlow
41 TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow
42 TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow
43 Advances in machine learning and TensorFlow (Google I/O '18)
Advances in machine learning and TensorFlow (Google I/O '18)
TensorFlow
Distributed TensorFlow training (Google I/O '18)
Distributed TensorFlow training (Google I/O '18)
TensorFlow
45 Classification using neural networks & ML regression models #AskTensorFlow
Classification using neural networks & ML regression models #AskTensorFlow
TensorFlow
46 TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow
47 Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
TensorFlow
48 How to get started with AI/ML, retraining models, & more! #AskTensorFlow
How to get started with AI/ML, retraining models, & more! #AskTensorFlow
TensorFlow
49 TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow
50 MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
TensorFlow
51 The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
TensorFlow
52 At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
TensorFlow
53 NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
TensorFlow
54 Try TensorFlow.js in your browser (Coding TensorFlow)
Try TensorFlow.js in your browser (Coding TensorFlow)
TensorFlow
55 TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow
56 How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
TensorFlow
57 Training models faster with TensorFlow Hub (TensorFlow Meets)
Training models faster with TensorFlow Hub (TensorFlow Meets)
TensorFlow
58 Prepare your dataset for machine learning (Coding TensorFlow)
Prepare your dataset for machine learning (Coding TensorFlow)
TensorFlow
59 Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
TensorFlow
60 TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TensorFlow

This video teaches how to use TensorFlow for distributed training of machine learning models, including data parallelism and synchronous all reduce approach. It covers how to use the distribution strategy API for faster training and how to build efficient input pipelines using TF Data API. By the end of this video, viewers will be able to train machine learning models using distributed training and improve model performance.

Key Takeaways
  1. Create a custom estimator with three arguments: model function, directory for model state, and run config
  2. Start training by calling the train method with input function
  3. Add one line of code to instantiate mirrored strategy and pass it to run config
  4. Use TF Data API to build efficient input pipelines
  5. Apply shuffle, repeat, map, and batch transformations in the transform phase
  6. Provide processed data to the accelerator for training in the load phase
  7. Parallelize file reading using non-blocking reads
  8. Parallelize map function for transformations using num_parallel_calls argument
  9. Use prefetch to decouple data production from consumption
  10. Use fused transformation ops to improve performance
💡 Distributed training can significantly improve the speed of machine learning model training, and TensorFlow provides a range of tools and APIs to support this, including the distribution strategy API and TF Data API.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →