Scaling Tensorflow data processing with tf.data (TF Dev Summit '20)

TensorFlow · Advanced ·🔧 Backend Engineering ·6y ago
Skills: ML Pipelines90%

Key Takeaways

The video discusses scaling TensorFlow data processing with tf.data, covering tf.data service and tf.data.snapshot for distributed data processing.

Full Transcript

[Music] hi all I'm Rohan and I'm here to talk to you about how you can scale up an input data processing the TF data so let's start with a high-level view of your ML training job typically your ML training step will have two phases to it the first is data pre-processing where you're going to look at the input files and do all kinds of transformations on them to make them ready for the next phase which is model computation while you're doing data pre-processing which happens on the CPU you might be doing some kind of things such as for images you're cropping them for videos you may be sampling them and whatnot so if your training speed is slow you could have a bottleneck in either one of these two places and I hope that the talk on profiling would give you an indication on how to figure out which one of the two phases you are getting slow at and I'm here to talk to you about the first kind of pre-processing bottle like the bottleneck which is data pre-processing so let's try to look into what this bottleneck really is so the in the last few years we've done a fantastic job making accelerators which do the ML operations really fast and so the amount of time it takes us to do a matrix operation on all the linear algebra operations is is a lot smaller but the hosts and the CPUs that feed the data to these accelerators have not been able to keep up with them and so there ends up being a bottleneck we thought that we could mitigate this by making the models more complex but what happens is that the accelerators have constraints on how much RAM they have and more importantly where you deploy these models tends to be something like a mobile device or something like that which tends to restrict the amount of complexity you can introduce into your model so that hasn't really panned out the second approach people take is that they try to attain larger batch sizes but larger batch require a large amount of pre-processing to assemble the batch so then that puts further pressure on them so that's why this is becoming an increasingly larger problem within alphabet and even externally and I'm going to talk to you about how you can solve it using TF data D F data is tensor flows data pre-processing framework it's fast it's flexible and it's easy to use and you can learn more about it at our guide to sort of for background for the rest of the talk I think I'm going to go through like a typical T of data pipeline and that will help us in the later later stages so suppose you have some data in some TF record files which are your training data so you can now start off with a TF record data set with that data and then after that you start doing your pre-processing this is typically the bulk of the logic so if it's images you're doing cropping maybe flipping or all sorts of things there after that you will shuffle the data so that you don't train to the order in which you see the the examples and the the input and that helps you with your training accuracy and after that we would batch it so that the model the accelerator can now take use of vectorized computations finally you want to do some software pipelining so that you ensure that while the model is of working on one batch of data you can then the pre-processing side can produce the next batch so that everything works very efficiently finally you can then feed this TF data data set to a karas model so that you can now start doing your training so given that sort of basic pipeline and suppose you have a bottleneck the first thing I'd recommend you to do is to go through our single host performance guide and try to utilize every trick and transformation that is available in TF data to be able to extract the maximum possible performance so that you using all the cores and and and whatever there's excellent information at the guide that we have here and even Giri did a great talk at the ML Tokyo Summit which you can take a look at to learn more about this so that's the first thing I'd recommend you do but suppose you've done that and you've tried all the different recommendations that we have here but you're still bottlenecked on that on that data pre-processing part and don't worry you're not alone this is very common we've increasingly seen this with a lot of internal customers and so now I'm going to I'm very pleased to present a couple of solutions that we've been working on on the team to help you solve that problem so the first idea is that why why don't we just do use the computation so suppose you're playing around with different model architectures your input pre-processing sort of part kind of remains the same and if it's expensive and time consuming why don't we just do it once save it and then every subsequent time we just read from it and do that quickly so we noticed a bunch of internal customers teams within alphabet who were trying to do this on their own outside of TF data and we decided to bring it into TF data and make it incredibly fast flexible and easy to use and so this is what we call snapshot the idea is what I explained to you it's you materialize the output of your data pre-processing once and then you can use it many many times this is incredibly useful for playing around with different model architectures and if we have settle down on an architecture doing hyper parameter tuning and so you you you can get that speed up using snapshot next I'm gonna go through the pipeline that we talked about before and see how you can add snapshot to it to make it faster so that's the original pipeline that we had and so notice that there's this pre-processing step which is expensive so now with snapshot you just add a snapshot transformation right after that with a with the directory path and with this everything that is before the snapshot will now be written to disk the first time it's run and then every subsequent time we would just read from it and we would go through the rest of the steps as usual one thing I'd like to point out is that we place the snapshot at a particular location before the shuffle because if it's after the shuffle everything gets frozen so all the randomization that you get out of shuffle you lose because every subsequent time you're just going to be reading the same exact order again and again so that's why we introduced it at that at that stage in the pipeline so snapshot we've been we've developed it internally there are internal users and teams that are using it and and driving benefit out of it and now we're bringing it to the open source world we published an RFC which has more information about it and some of the technical details and this should be available in tensorflow 2.3 but I believe it would be available in the nightly shortly so remember I talked about two ideas so the second idea is that now not all computation is reusable so because suppose you had some randomized crops in there and if you wrote that to disk and read them back you again use that lose that randomization and so snapshot is probably not applicable in that scenario so the second idea is to be able to distribute the computation so the initial setup is that you have one host CPU which is driving a bunch of this accelerator but now you can offload this computation from this horse to maybe a cluster and now you can utilize the ability and the computational power that you have for all these different workers to be able to feed the host so that you're not bottleneck on the input pre-processing anymore and things move fast this is TF data service it's it's a TF data feature that allows you to scale your workload horizontally so if you're seeing a slowness in your input pre-processing you can start adding workers and it'll just scale up it's got a master worker architecture where the master drives the work for the different workers and it gives you fault tolerance so if one of the workers fails you're still good and you still can make progress so let's see how you can use the TF data service for your for the example that we have so here instead of having sort of an expensive pre-processing let's say you have some random SP processing so now this is not snapshot able because if you snapshot then the you lose the randomization so we will provide you a binary which allows you to run the data service on the cluster set of manager that you like whether skew burn at ease or cloud or something like that and then you can once you have that up and running you can just add or distribute transformation to your TF data pipeline and provide the master address anything before the distribute transformation would now get run on the cluster that you have set up and everything after will run on the host and so this allows you to sort of scale up again note that because we are not doing any kind of freezing of the data we can now put this transformation as late as possible in there so notice that I've put it after shut the shuffle transformation the service like snapshot has been developed with internal users they've been using it and it's been like a game-changer in terms of TPU utilization and now again we're bringing it to to you and so we published an RFC which was well-received and this should be available in in 2.3 for you to play around with so to summarize what did I talk about today so as with various trends in hardware and software we've ended up in a scenario where a lot of input machine learning jobs are getting bottlenecked on input pre-processing and I've told about two solutions that TF data team has been working on to help you solve this bottleneck first a snapshot which allows you to reuse your pre-processing so that you don't have to do it multiple times and the second is the TF data service which allows you to distribute this computation to a cluster so that you you get the scale up that you need I hope you play around with these and give us feedback and thank you for your time [Music]

Original Description

As model training becomes more distributed in nature, tf.data has evolved to be more distribution aware and performant. This talk presents tf.data tools for scaling TensorFlow data processing. In particular: tf.data service that allows your tf.data pipeline to run on a cluster of machines, and tf.data.snapshot that materializes the results to disk for reuses across multiple invocations. Speaker: Rohan Jain - Staff Software Engineer Resources: GitHub Distributed tf.data service → https://goo.gle/2VrYDi2 tf.data: Build TensorFlow input pipelines → https://goo.gle/2VTnnjk Better performance with the tf.data API → https://goo.gle/38wyKAy GitHub tf.data snapshot → https://goo.gle/2v42Ai8 Watch all TensorFlow Dev Summit 2020 sessions → https://goo.gle/TFDS20 Subscribe to the TensorFlow YouTube channel → https://goo.gle/TensorFlow event: TensorFlow Dev Summit 2020; re_ty: Publish; product: TensorFlow - General; fullname: Rohan Jain;
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from TensorFlow · TensorFlow · 0 of 60

← Previous Next →
1 The TensorFlow YouTube Channel is Here!
The TensorFlow YouTube Channel is Here!
TensorFlow
2 Answering Your TF Questions #AskTensorFlow
Answering Your TF Questions #AskTensorFlow
TensorFlow
3 Chatting With the TensorFlow Community (TensorFlow Meets)
Chatting With the TensorFlow Community (TensorFlow Meets)
TensorFlow
4 All About TensorFlow Code (Coding TensorFlow)
All About TensorFlow Code (Coding TensorFlow)
TensorFlow
5 TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow
6 Keynote (TensorFlow Dev Summit 2018)
Keynote (TensorFlow Dev Summit 2018)
TensorFlow
7 tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
TensorFlow
8 Eager Execution (TensorFlow Dev Summit 2018)
Eager Execution (TensorFlow Dev Summit 2018)
TensorFlow
9 Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
TensorFlow
10 Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
TensorFlow
11 The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
TensorFlow
12 Distributed TensorFlow (TensorFlow Dev Summit 2018)
Distributed TensorFlow (TensorFlow Dev Summit 2018)
TensorFlow
13 Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
TensorFlow
14 TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow
15 Searching Over Ideas (TensorFlow Dev Summit 2018)
Searching Over Ideas (TensorFlow Dev Summit 2018)
TensorFlow
16 Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
TensorFlow
17 Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
TensorFlow
18 Open Source Collaboration (TensorFlow Dev Summit 2018)
Open Source Collaboration (TensorFlow Dev Summit 2018)
TensorFlow
19 Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
TensorFlow
20 TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow
21 Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
TensorFlow
22 Real-World Robot Learning (TensorFlow Dev Summit 2018)
Real-World Robot Learning (TensorFlow Dev Summit 2018)
TensorFlow
23 TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow
24 Project Magenta (TensorFlow Dev Summit 2018)
Project Magenta (TensorFlow Dev Summit 2018)
TensorFlow
25 TensorFlow Dev Summit 2018 - Livestream
TensorFlow Dev Summit 2018 - Livestream
TensorFlow
26 Introducing TensorFlow Lite (Coding TensorFlow)
Introducing TensorFlow Lite (Coding TensorFlow)
TensorFlow
27 TensorFlow Dev Summit 2018 Highlights
TensorFlow Dev Summit 2018 Highlights
TensorFlow
28 Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
TensorFlow
29 TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow
30 Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
TensorFlow
31 Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
TensorFlow
32 TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow
33 Using the tf.data API to build input pipelines (TensorFlow Meets)
Using the tf.data API to build input pipelines (TensorFlow Meets)
TensorFlow
34 Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
TensorFlow
35 Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
TensorFlow
36 TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow
37 Get started with TensorFlow's High-Level APIs (Google I/O '18)
Get started with TensorFlow's High-Level APIs (Google I/O '18)
TensorFlow
38 TensorFlow for JavaScript (Google I/O '18)
TensorFlow for JavaScript (Google I/O '18)
TensorFlow
39 TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow
40 Get started with TensorFlow's High-Level APIs in 5 mins |  Google I/O 2018
Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018
TensorFlow
41 TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow
42 TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow
43 Advances in machine learning and TensorFlow (Google I/O '18)
Advances in machine learning and TensorFlow (Google I/O '18)
TensorFlow
44 Distributed TensorFlow training (Google I/O '18)
Distributed TensorFlow training (Google I/O '18)
TensorFlow
45 Classification using neural networks & ML regression models #AskTensorFlow
Classification using neural networks & ML regression models #AskTensorFlow
TensorFlow
46 TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow
47 Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
TensorFlow
48 How to get started with AI/ML, retraining models, & more! #AskTensorFlow
How to get started with AI/ML, retraining models, & more! #AskTensorFlow
TensorFlow
49 TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow
50 MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
TensorFlow
51 The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
TensorFlow
52 At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
TensorFlow
53 NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
TensorFlow
54 Try TensorFlow.js in your browser (Coding TensorFlow)
Try TensorFlow.js in your browser (Coding TensorFlow)
TensorFlow
55 TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow
56 How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
TensorFlow
57 Training models faster with TensorFlow Hub (TensorFlow Meets)
Training models faster with TensorFlow Hub (TensorFlow Meets)
TensorFlow
58 Prepare your dataset for machine learning (Coding TensorFlow)
Prepare your dataset for machine learning (Coding TensorFlow)
TensorFlow
59 Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
TensorFlow
60 TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TensorFlow

This video teaches how to scale TensorFlow data processing using tf.data tools, specifically tf.data service and tf.data.snapshot, to improve model training performance.

Key Takeaways
  1. Use tf.data service to run data pipelines on a cluster of machines
  2. Utilize tf.data.snapshot to materialize results to disk for reuse
  3. Implement distributed data processing for model training
💡 Distributed data processing with tf.data can significantly improve model training performance

Related AI Lessons

Up next
This Cop Was Held Accountable For His Brutality! #police #lawyer
Hampton Law
Watch →