PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook

PyTorch · Intermediate ·📰 AI News & Updates ·6y ago

Skills: LLM Foundations80%LLM Engineering80%Fine-tuning LLMs70%Prompt Craft60%ML Maths Basics50%

Key Takeaways

The video discusses the collaboration between Google, Facebook, and Salesforce to enable PyTorch support on Google Cloud TPUs, allowing for accelerated workloads and large-scale model training and deployment. The project utilizes various tools such as PyTorch, Google Cloud TPUs, TorchEx, and Hydra to achieve this goal.

Full Transcript

okay I'm not Vishal but I wanted to personally introduce this session this one is actually pretty near and dear to my heart this is a collaboration that's been going for almost a year and a half now between Google Facebook and Salesforce and so I've mentioned it in the keynote let's just say a lot of Googlers ate a lot of Facebook food a lot of Facebookers ate a lot of Google food and the Salesforce folks ate a lot of wealth a lot of Facebook food so I want to introduce Vishal Mitra from Google and then after him Brian McCann will speak from a user perspective and then finally I linked John from Facebook we'll talk about how we actually built the stack which is actually all the magic that actually makes all this happen so please welcome Vishal [Music] oh thank you Joe hello everybody I'm Michele I'm an engineering manager at Google and I work on cloud tepees so today I'd like to talk everybody about how cloud GPUs work with PI torch we're very excited to present you a really good session few talks and let's just get into it so first of all who were here last year quick show fans Wow so as you probably saw we were here we're excited to be back but last year we talked how we support on GCP on Google cloud platform for pi torch we announced some availability of deep learning VMs we talked about integrations with KU flow and pencil board but that's not all we actually spoke about a promise a promise to bring cloud TP use 2 pi torch and we've been working hard on that and working closely with the with the core team and today very happy to announce the availability of cloud GPUs in 1.3 you can follow along on the github and as we proceed and improve this love to get your feedback so a little bit about what this means as a user to you it means you can now run your standard PI torch code on cloud GPUs right now yes I mean take what you have today and just run it on these fantastic accelerators and like you're used to running your code today in in in in just normal fashion and behind that ease of use there's a lot going on though and PI torch behind the scenes talks to cloud GPUs via our excel a the manager Excel a compiler which actually uses computational graphs so to make things easy though what we have done is created this new architecture called lazy tensor which behind-the-scenes compiles and creates these graphs and we're going to talk a little bit more about that in the next session but those graphs are then automatically compiled and then executed on the on the hardware and the TP Hardware this integration has been from grounds up built to scale scale to real-world real-world workloads that you guys need and maybe a little bit about what cloud TP use do for scalability most of you might be aware that cloud GPUs are available in two major configurations the individual host or a single systems and then from there to these massive multi rack supercomputers which you know can go up to a tracks and and many many hundreds of peda flops of compute these parts as we call them cloud TP pods are capable of accelerating your workloads substantially I mean way beyond what any single host system can do so needless to say we're committed to supporting PI torch on not just single host systems but also on these fantastic pods so announcing today also experimental level support for cloud TPU pods is now available in height or 1.3 let's talk a little bit about what these pods can do we've already seen in our testing some fantastic acceleration for example resident 50 training compared to what it can be done on a single host like a v38 device to what can be done on cloud TP pause slices ranging from 32 64 and more larger core slices and the pods can go all the way up to 2048 course so we're super excited to continue working on this and continue to enable and improve the scalability that these pods can bring to the pipe role of workloads a lot of this that we're talking about today is already available for you to try for free those who are familiar with collab those who are here right now those who are watching online I invite you to go hit the pi torch TPU link here on collab and start trying it out right now the collabs are fantastic they're free and will give you a good idea of what we can do with these things so none of this that we've talked about today would have been possible were it not for the collaboration that has happened between multiple teams and notably cloud TPU team the PI torch core team and then the larger PI torch community some of these folks are going to talk today specifically coming next will be Brian McCann from Salesforce Research who have been an early adopter of cloud tip use and PI torch and we'd love to hear their first-hand experience and then Eileen Zang who's from Facebook engineering we work super closely with that with the team on integrating the internal technical internals of GPUs via Excel Excel a compiler and she'll talk about the details of that integration so thank you all for being here for sitting through the fire issue and still being here to listen to this talk very excited to present this and now I'd like to call Brian McCann on the stage thank you [Music] [Applause] [Music] No I apologize for the scratchy voice I can off-site with the team yesterday and I got too excited so I yelled a lot but my name is Brian McCann I'm a research scientist at Salesforce research and I'm going to talk to you for the next few minutes about my perspective as a user experimenting with the pipe torch XL a software that allows me to use my torch on the cloud GPUs so for a little context as Salesforce research we've we've really been up high-torque shop for as long as I can remember the last few years and it's always been the case in pure research of course and it's worked its way into Applied Research and it's interesting it's way closer and closer to being an integral part of our production systems that come out of research as well so PI torch is really our go-to framework across all of our different domains pretty much every researcher in natural language processing reinforcement learning computer vision even the people know working on generalization bounds or using PI torch to get their experimental results and the only time we've deviated from this is in the near or the recent past over the last few months when we've been exploring large-scale language modeling and controllable text generation it was really on this project that I came to fully appreciate the value add of the cloud tip used not only in accelerating my research but also impacting the direction of it so for the rest of the talk I'm just gonna dig into some example code that I wrote up because I was very excited to see this possibility of writing things in PI torch and then running them on cloud tea pews so I could kind of have my whole team easily writing the way that they want to write their code and modeling but actually have the easy scaling that comes with the cloud to use I mean there I was obviously very excited about this so I'm just gonna walk you through some of the code I wrote when I was first poaching it and doing some basic language modeling and I want to highlight some things I noticed just so that you might have an even smoother experience and have an easier time getting started so the first thing that I was thinking about when I was jumping into this was the model that's the first thing that came to mind and because I was using transformer models for language modeling I really didn't have to do anything like no lines of code changed which was great this was quite a relief and this is going to be the case for most standard models if you're in vision and you're using ResNet something like that transformers for language modeling or any other kind of area of NLP you should mostly be good to go and we'll talk about a little later about you know if you do see some slowdowns or something like that talk about logging and debugging at the end if you do have dynamic models that the inputs are changing a lot you're gonna want to stay tuned for that later part of the talk so with models out of the way I then started to talk about devices you know how do I get my models actually on the tipi use and this is super simple you're gonna use torchic's la that's your interface between like the PI torch code you write and the actual Excel a and cloud T be use so when you're dealing with TPU devices or CPU devices you need torch excellet to handle that relationship and we're gonna talk a little bit about how that factors into all the other parts of your modeling code as well so with data sets it's a very similar story you can pretty much keep all your code the same which was really nice for me because when I was using cloud TP use originally there's this extra step that I had to go through where I'd have to run all my pre-processing upload to cloud storage and the TP use would only read and write from these buckets or this cloud storage but in the early stages of development when maybe I don't even know exactly what my research direction is or I don't know what the models supposed to be or maybe even the data set is changing underneath me it was really helpful working in PI torch on GPUs because I could have my lazy data loaders and have the quick rapid prototyping where I don't have to go through that extra stage of uploading in class or everything like that I can just use my normal datasets the way they are now for data parallelism which was kind of essential for actually using these TP use the way you would want you really want to think about distributed data parallel if you're familiar with data parallel and distributed data parallel on PI torch the distributed version is is the way you want to think this is gonna require the fewest changes to your code and even though the other version technically exists just stick to this one it'll be easier it'll make your life easier and all you have to do is use the torch like sleighs parallel loader wrap your data loader and then get the specific data loader for your device after that just treat it like a normal data loader so this is also quite smooth now and the one thing you don't want to forget here is to make sure you use the torch Excel a distributed launcher as well if you're familiar with like torch not distributed dot launch and things like that you want to use the torch exit light version because they want to handle that the one thing you'll want to make sure you add is this spawn call at the beginning of your script we don't typically do that in distributed data parallel on normal PI torch but you'll want to add it here and the really nice word about this is before when I would want to scale up too many GPUs I might have to spin up multiple pods and run torch tributed launch on each one and then they coordinate and while that's pretty easy as it is with the TB use you can really see the evidence of the ease of scaling here where I'm only gonna always have one master node here and whether I'm using eight devices or 512 devices or a thousand or so devices regardless of how many TP devices you're using you're just gonna have that one host and that's all you match which is really convenient so now we can kind of talk about the training and the recurring theme here is how little I actually had to do so here everything is gonna look like pi torch code of course execution isn't happening you would expect it's all lazy and results are being computed until they're required and it's really just this one line of the torso excellia optimizer step that's the one thing that you need to not forget and this is what's gonna trigger all of your execution as long as you do that you should be okay you might want to do some other kind of things with your loss or your gradients and use one area where I would say just be a little careful and communicate with the team when I first started gradient clipping was implemented in a way in in pi torch that relied on tensors item functionality and that forces transfer to CPU we don't want to do that as you know we want to stay away from CPU as much as possible when we're using the excel a so the good news here is the team's really quick with this kind of stuff and they've already rewritten you know lots of different things like this and it's bundled with the PI torch version that comes with torch Excel a so if you do jump in like me and you notice anything weird like this just make sure to point out and though I'll make sure to help you out with it and on that note I would really suggest learning from the examples that are already there documentation is improving as the project grows but there are some you know little things that you might unexpect you might not expect so for example the data loaders I was talking about before typically return tuples of index and object and if you're like me at least and you're thinking kind of in Python terms I would expect that to be the result of may be calling enumerate on my data loader so that's what I did because I wanted the index and the object and I end up with like a nested tuple and things like that and it just threw me a little bit but these are things that if you look at the examples and you're you're kind of seeing how they're using their objects it'll be it'll be a smooth ride so logging and debugging is kind of the thing you probably won't have to think about very much especially if you're sticking to really standard use cases but the key tool here is going to be excellent metrics reporting anytime you file an issue or a bug just make sure you run a few iterations where you get these metrics reported and you can give those over to the team it's basically keeping a bunch of counters of all the different allocations on TPU and CPU it's gonna report your compile time and that's gonna be where you look to see whether your recompiling to graph to often things like that the main takeaways for me were essentially to look at the compile time make sure that's decreasing as training is going on and then look for counters that have the a2 namespace because those are CPU allocations and we don't we obviously don't want those so the last bit I want to talk about is actually getting started and most of the torch Excel a PI to actually code that you'll come across was tested in the GCE environment the Google compute engine I actually wasn't in that environment so it was on gke the kubernetes engine and I can kind of a test to the fact that you can get it working in all these different environments just fine I would just recommend you stick to docker images that the team gives you and just follow other instructions and and they'll get you through it don't don't worry about any little bumps in the road as you're getting your environment set up and I would also as a parting a piece of advice just at least once even if your experience is perfectly seamless and you just run your code on GPUs turn all the debugging environment variables on turn out turn on the metrics reporting take a look at it and get some intuition for what's going on you know it's there they're doing this amazing work making it super easy for us but it helps to have some intuition yourself so this will help you catch some unnecessary transfers of CPU and it'll help you you know understand how your model is interacting with these devices more effectively and with that say thank you and I want to bring on huiling who's going to tell you a little bit more about how all this actually works thank you [Music] hello everyone I'm Eileen I'm a software engineer here at Facebook we have been working with Google on tighter checks our integration for a long time and today I'm going to talk about the hydrotech a integration and behind the scenes for what we have been doing to make this transition really easy first let's take a look at an overview of the integration from end to end with a simple Hydra code snippet on the Left torchic saw a translated to a IR graph optimizer and then compile it into machine instructions and then run on TPU devices so this is the whole process now let's step into each step and see what's happening under the hood so you have seen from grinds talk that from user and accelerate answers share exactly the same semantics as other hydros devices your hydric mother's works on TPU devices with just very minimal changes on one hand we really want to keep this eager user experience of Fighters on the other hand we want to make use of the graph level optimization that the compiler give us so that's why excel is designed to be a lazy tensor extension to pi torch where it the first evaluation until necessary what does that mean let's take a look at an example we're all familiar with how Hydra shield sorry typically runs some operations where they just ran eagerly by waiting them to finish and then move on to next for example a half convolution note here I will wait for it to finish and then move on to next and I will wait for relu to finish and then move on to maxvill but this is different in Excel a case because accelerate answer is lazily evaluated as users run the Python code on the Left Excel way simply records the operations in the graph and when the users requires the results like the print statement on the left the system automatically triggers the graph execution to return the result back to users so from your perspective as a pie chart user all of this is invisible you don't really have to worry about it now let's talk about how the graphs are optimized and executed on client side we do cache compilations by hashing the graph this speed of the training a lot with the deferred execution that's how a compiler gets the chance to see the whole graph and do our fusion CSE this compiler tricks works great for static shapes in real training we also overlap TPU computation of step n with the growth construction of step n plus 1 so that we also maximize the throughput on the right it shows a rest at 50 training over time and how long each step takes you can see the compilation occurring at the start of the training and then after the graph is optimized there is a significant speed up one thing I really want to highlight here is this process that handles the building the graph sending them to X ie device optimizing running and then retrieving the results this whole process is invisible to users torch Aksaray handles for year behind the scenes height or char koay is different from titres other eager backends but from this difference are mostly hidden away from users working with TV is you just write title code as you normally do and it will just work even your favorite debugging tools like PDB print they just work as expected so that's why I want to talk about that technical part of the project most importantly please try it out our code base is hosted on github we do regular releases this launch is a great milestone for us but we will definitely keep improving and bringing new exciting new features and so your feedback is really valuable for us we would love to hear from you you can find us on github issues and we will have a few laughs demos in the poster session we will have the collab demo TP pod training or fair cig training and please come check out our demos thank you [Applause]

Original Description

Google Cloud TPU support in PyTorch is now broadly available. Hear how engineers from Facebook, Google, and Salesforce worked together to enable and pilot Google Cloud TPU support in PyTorch, including experimental support for Cloud TPU Pods.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from PyTorch · PyTorch · 33 of 60

← Previous Next →

What is PyTorch?

What is PyTorch?

PyTorch Tutorial: A Quick Preview

PyTorch Tutorial: A Quick Preview

PyTorch Summer Hackathon 2019

PyTorch Summer Hackathon 2019

Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz

Tips and Tricks on Hacking with PyTorch: A Quick Tutorial by Brad Heintz

PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang

PyTorch 1.2 and PyTorch Hub: A Quick Introduction by Soumith Chintala and Ailing Zhang

Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang

Torchtext 0.4 with Supervised Learning Datasets: A Quick Introduction by George Zhang

Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian

Torchaudio 0.3 with Kaldi Compatibility, New Transforms: A Quick Introduction by Jason Lian

Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa

Torchvision 0.4 with Support for Video: A Quick Introduction by Francisco Massa

Introduction to Machine Learning for Developers at F8 2019

Introduction to Machine Learning for Developers at F8 2019

Powered by PyTorch at F8 2019

Powered by PyTorch at F8 2019

Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019

Developing and Scaling AI Experiences at Facebook with PyTorch at F8 2019

New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019

New Approaches to Image and Video Reconstruction Using Deep Learning at Facebook at F8 2019

PyTorch Developer Conference 2018: Recap

PyTorch Developer Conference 2018: Recap

PyTorch Developer Conference 2018: Keynote & Deep Dive

PyTorch Developer Conference 2018: Keynote & Deep Dive

PyTorch Developer Conference 2018: Production & Research Sessions

PyTorch Developer Conference 2018: Production & Research Sessions

PyTorch Developer Conference 2018: Cloud & Academia Sessions

PyTorch Developer Conference 2018: Cloud & Academia Sessions

PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel

PyTorch Developer Conference 2018: Enterprise, Education, & Future of AI Panel

PyTorch Developer Conference 2019 | Full Livestream

PyTorch Developer Conference 2019 | Full Livestream

PyTorch Developer Conference 2019: Recap

PyTorch Developer Conference 2019: Recap

PyTorch Developer Conference Keynote - Mike Schroepfer

PyTorch Developer Conference Keynote - Mike Schroepfer

What’s new in PyTorch 1.3 - Lin Qiao

What’s new in PyTorch 1.3 - Lin Qiao

PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan

PyTorch Front-End Features: Named Tensors and Type Promotion - Gregory Chanan

Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo

Research to Production: PyTorch JIT/TorchScript Updates - Michael Suo

Quantization - Dmytro Dzhulgakov

Quantization - Dmytro Dzhulgakov

PyTorch ONNX Export Support - Lara Haidar, Microsoft

PyTorch ONNX Export Support - Lara Haidar, Microsoft

Apex - Michael Carilli, NVIDIA

Apex - Michael Carilli, NVIDIA

Dataloader Design for PyTorch - Tongzhou Wang, MIT

Dataloader Design for PyTorch - Tongzhou Wang, MIT

Linear Algebra in PyTorch - Vishwak Srinivasan, CMU

Linear Algebra in PyTorch - Vishwak Srinivasan, CMU

PyTorch Mobile - David Reiss

PyTorch Mobile - David Reiss

Model Interpretability with Captum - Narine Kokhilkyan

Model Interpretability with Captum - Narine Kokhilkyan

Detectron2 - Next Gen Object Detection Library - Yuxin Wu

Detectron2 - Next Gen Object Detection Library - Yuxin Wu

Speech Extensions to Fairseq - Dmytro Okhonko

Speech Extensions to Fairseq - Dmytro Okhonko

PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook

PyTorch on Google Cloud TPUs - Google, Salesforce, Facebook

PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu

PyTorch Summer Hackathon Winners - Joe Spisak, Sebastien Arnold, Tristan Deleu

PyTorch in Robotics - Yisong Yue, Caltech

PyTorch in Robotics - Yisong Yue, Caltech

StanfordNLP - Yuhao Zhang, Stanford

StanfordNLP - Yuhao Zhang, Stanford

Sotabench for Reproducible Research - Robert Stojnic, Papers with Code

Sotabench for Reproducible Research - Robert Stojnic, Papers with Code

Collaborative Natural Language Inference - Sasha Rush, Cornell

Collaborative Natural Language Inference - Sasha Rush, Cornell

Privacy Preserving AI - Andrew Trask, OpenMined

Privacy Preserving AI - Andrew Trask, OpenMined

CrypTen - Laurens van der Maaten

CrypTen - Laurens van der Maaten

PyTorch at Uber - Sidney Zhang, Uber

PyTorch at Uber - Sidney Zhang, Uber

PyTorch at Tesla - Andrej Karpathy, Tesla

PyTorch at Tesla - Andrej Karpathy, Tesla

PyTorch at Microsoft - Saurabh Tiwary, Microsoft

PyTorch at Microsoft - Saurabh Tiwary, Microsoft

PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs

PyTorch at Dolby Labs - Vivek Kumar, Dolby Labs

PyTorch Developer Conference 2019 - Panel Discussion

PyTorch Developer Conference 2019 - Panel Discussion

Using deep learning and PyTorch to power next gen aircraft at Caltech

Using deep learning and PyTorch to power next gen aircraft at Caltech

Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1

Named Tensors, Model Quantization, and the Latest PyTorch Features - Part 1

TorchScript and PyTorch JIT | Deep Dive

TorchScript and PyTorch JIT | Deep Dive

Announcing the PyTorch Global Summer Hackathon 2020

Announcing the PyTorch Global Summer Hackathon 2020

Opening Up the Black Box: Model Understanding with Captum and PyTorch

Opening Up the Black Box: Model Understanding with Captum and PyTorch

PyTorch Mobile Runtime for Android

PyTorch Mobile Runtime for Android

Torchvision in 5 minutes

Torchvision in 5 minutes

3D Deep Learning with PyTorch3D

3D Deep Learning with PyTorch3D

What is Torchtext?

What is Torchtext?

TorchAudio: A Quick Intro

TorchAudio: A Quick Intro

PyTorch Mobile Runtime for iOS

PyTorch Mobile Runtime for iOS

PySlowFast: Deep learning with Video

PySlowFast: Deep learning with Video

PyTorch Pruning | How it's Made by Michela Paganini

PyTorch Pruning | How it's Made by Michela Paganini

Measuring Fairness in Machine Learning Systems

Measuring Fairness in Machine Learning Systems

PyTorch for Hackathons

PyTorch for Hackathons

The video teaches how to use PyTorch on Google Cloud TPUs for large-scale model training and deployment, and how to utilize various tools and techniques to achieve this goal. It also discusses the collaboration between Google, Facebook, and Salesforce to enable PyTorch support on Cloud TPUs. The project has the potential to accelerate workloads and improve model performance, making it a valuable resource for machine learning practitioners.

Key Takeaways

Use TorchEx as the interface between PyTorch code and TPU devices
Wrap data loader with torchExcellet's parallel loader
Use torchExcellet's distributed launcher for scaling up to multiple TPU devices
Add spawn call at the beginning of the script
Utilize Lazy tensor extension for graph level optimization
Implement deferred execution and caching for compilation and fusion
Overlap TPU computation with graph construction for maximum throughput

💡 The collaboration between Google, Facebook, and Salesforce has enabled PyTorch support on Google Cloud TPUs, allowing for accelerated workloads and large-scale model training and deployment.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

[PoV] When Everyone Is Smart, No One Is

In a world where AI makes everyone smart, the value of intelligence decreases, and new challenges arise

Critical thinking in the AI Era

Develop critical thinking skills to navigate the AI era effectively and make informed decisions

Medium · Data Science

Anthropic Just Passed OpenAI Among Business Users. Here’s What That Means for Your Stack.

Anthropic surpasses OpenAI in business user adoption, impacting the AI stack for enterprises

Introducing beLithe: AI Courses Built for Real People, Not Engineers

Learn about beLithe, an AI course platform designed for non-technical individuals, and its mission to make AI accessible to everyone

Channels Television