Inside TensorFlow: Eager execution runtime
Skills:
ML Pipelines70%
Key Takeaways
The video discusses the eager execution runtime in TensorFlow, presented by Software Engineer Alex Passos, providing a technical deep dive into the framework.
Full Transcript
hi my name is Alex and I'm here again this time to talk about the test of flow eager execution runtime this is a very broad topic and there are lots and lots of things we could cover so I'm going to lightly graze many many different parts of our code base I'll give you a lot of like function names and file names and things like that that you can use familiarize yourself with something but if you're in the room right now by all means ask questions like this is there's some buffer time at the end to account for variability here and I think the more we can maximize shared understanding of this stuff the better so the way I thought about we could go about this it should do a very very deep dive on what actually happens intensive flow is starting from tf2 when you type a very simple line of code in this case a TF dot an end or relu of some Python numbers and I think if you were to start doing this probably the first thing you do is graph the tensorflow code base to find why would we define reloj and if you do that you'll find that we have some definitions of reloj in Kerris but you won't find a single definition of reloj itself in court in the flow and that might be a little surprising at first it might put a damper in this whole let's find out what actually happens when you run rather business but the way reloj comes from is because it's an OP that we've implemented in C++ and we didn't need to put a complicated Python API around it we can just generate the Python code to call relu so the way it's actually defined it's defined using the same mechanism we use to register all the ops intensive flow so for every core operation intensive flow that's visible to the runtime we have a registration that looks like this register up it takes a name and you say how many inputs how many outputs what attributes it has if you want to know more about attributes what things are allowed in there there how we can make our ops polymorphic and have the same operation have different types of outputs or different numbers of outputs and things like that there's a lot of documentation about this intensive for org if you search for how to define a new op and another interesting thing in there is that we also register a shape inference function value thankfully is one of the simplest office we have it just has one input one output they have the same shape so we can use a pre-built shape inference function that just says shape does not change other UPS will have vastly more complicated shape inference functions and the nice thing is that we can run these functions offline our building graphs without actually having the values of any tensors and still be able to prove things about the shapes of intermediate tensors and outputs of your computation this is maybe the best tool we have for catching bugs now so if you want to look at the shape inference code that's where you hook into now that we have that registration we run some complicated code that generates Python code to actually call Ratliffe and if you look in basil gen files you'll find a file named Jen and an OP study why and this file has the actual death value that we call and as you can see it's not pretty and there's a lot of stuff going in there the first line deals with dispatching so that we can define rallo not just for normal tensors but also optionally for sparse tensors and ragged tensors and other composite types the second line has TF export and what this does is define the tensile flow public API every symbol that you get when you're using tens of over TF dot something is defined somewhere from a TF export decorator like this one there will be a future video on how exactly this works and why we do things this way instead of relying on pythons normal you know namespace a mechanism but you can probably guess that is because tensorflow is very complicated and but essentially you'll see this in this generated code ferrell it has a bunch of cases in it there are roughly four you have an eager fast path an eager slow path you have a graph mode path and kind of a side hook for the symbolic execution but here let's focus on the eager paths in the first one the first thing that we're actually doing here is we're checking to see if we're in eager mode or not and to do that we look at this context thing this context thing is part of the core of the tensorflow v-2 runtime its moral equivalent to the session but it's longer lived in the session and represents more things so what is it from Python the context is this class that's defined in a file called context of UI and it's a collection of lot of things that your Python program needs to be aware to connect to things of low runtime it stores things like am i in eager mode or in graph mode or if I'm if someone used a wiff TF the device decorator what device are am I supposed to be executing code in and it stores things like what's your name scope and many other things and all of this information to the contacts stores in general the things that can change during the execution of your program they're all stored in thread local stacks usually stacks because we have these nested things like with do you have the device with TF device with TF the device so you'd like to be able to pop the stack so go back to where you were and thread local because it's very important to us at the tensorflow runtime itself btran agnostic so that if you write two threads and one is doing a reinforcement learning learner and the other is doing an agent that's talking to some game when the agent wants to use its GPU it shouldn't necessarily make the learner use the GPU and vice versa providing some kinds of isolation between the threads is what we felt like was the right way so that at least each single thread can feel like it's is is its own like single threaded Python program we use this a lot in distribution strategies like mirrored strategy use a lot of threads under the hood so it's really important to things are thread safe and the Python context essentially it mostly is a wrapper around the C++ context that's available for the tensorflow C API and this is the like core fing and it has a bunch more methods than just these like it can do a lot more things than just listing devices one thing I'd like to call out is that right now there are some things that are done in Python like story where the remote in graph mode and some things that are done in C++ and it's not the set of things are done in Python instead of things are done in C++ are likely to change I think it's tensorflow evolved as more and more things should migrate from the Python context to the C++ context which will make things more language agnostic but also faster and you know if everything in the context was in C++ then all that generated Python code could have just been C++ code and would have to would be able to get out of the overhead over execute in Python much soon and remove you know performance problems in our api's so once you know you're in ego mode we try to do this fast back execute in this the first path is some complicated C code the mostly does the same things that the fallback K is trying to do so I don't think it's necessarily worth reading that I would rather look at the simpler simpler code in the fallback path and it's here so this is where we actually implement the relu function and what are the interesting things here first we have this function called are matching eager and what it does is it takes a bunch of tensors a bunch of things that can be tensors converts them all to tensors of the same D type in case of value there's only one but in the case of any other ops like add or matmo they take multiple inputs some of which might be tensors others might be numpy arrays or Python lists or variables or other objects that you can convert to a tensor but there are not necessarily tensors and this is the thing that's responsible for like canonicalizing everything then once we have but here you might want to stop me a little and ask what is a tensor at the level of the things of a runtime and the Iger tensor class is a thing that's half implemented in Python and half implemented in the Python C API but it's a relatively thin wrapper around this thing that we call tensor handle that you can see in our tensor flow C API and this tensor handle it represents a potentially yet to be computed tensor which is going to live on some device and we know what device it's going to live on because we know what device we executed the operation that generates the tensor on and there are roughly three things you can do to attempt to handle really you can ask about its shape and D type and stuff that's one you can give it to execute that's another one and you can copy it to a device also I guess or four things and the fourth thing is you can ask hey what is its value and when you want to force the evaluation of its value the tensor flow grant I might have to pause until it can you know give you the result it might need to copy the tensor for some remote device and do all sorts of other things but once you're done you TF underscore tensor which is essentially just a pointer to some data and but every other operation that is not looking at the value of the tensor it's something that the runtime is free to delay reorder as long as it respects the intended semantics of your program and this is very important because when you're working with things like GPUs and TP use you'd like to be able to dispatch operations to run in the hardware as quickly as you can in a way that's asynchronous if the Python thread can raise the head of the GPU as it's executing operation so it can fully utilize the GPU hardware and get the maximum performance and even if the eager runtime we can do this if the GPU kernels are sufficiently heavy there are a few cases in which we need to copy ten so it's even if they're local on the CPU and some of those cases are going away like currently we copy string tensors every time you try to look at their values because tensorflow is internal string representation is a complicated C++ caste class and we would like to have an ABI stable c-type we're actually changing or in turn of string representation there's an RFC you can look about it that will make the internal and external representations both be an API stable C type but internally what is the tensor handle and this is in fact very very similar to the first implementation of tensor handle now it's like hundreds of lines of code spread across many files but all that you really need in a tensor handle is to know what device it's in and what data you have and the data can either be some a concrete tensor a value that has already been computed or a future that might be computed later because it's remote or asynchronous or in some other device but the core of it is this just this is the future that we handle around in the representation the current code is much more complicated than you you might want to look at it to see why it's complicated but the core idea is there so pop in the stack you now know where a tensor is and you've probably figured out that converting that list of Python integers to a thing so it's not terribly hard the C code that does that and now it's time for us to execute really great here is one thing to note about how the tensorflow around works which is that this is this line that is selected there is not non-controversial choice there in overall there two ways we could have gone about this we could have had and closed domain API in which tensorflow would export the symbol called reloj another called matmo another called comp etc all we could have this open domain API that we have here where we have a symbol called execute that takes the name of an operation relu and a bunch of metadata about it there are advantages and disadvantages in both cases in general to close domain case where you just have a single endpoint in your API for every operation you want to run that is easier to make fast however it's a little tricky in test flow because the current the pre-existing graph runtime has a few layers of interaction between a node in the graph and the actual kernel that it executes and indeed between trace of all versions we can without breaking graph def compatibility replace some kernels some things that were handled by a single kernel now has to be handled by like many multiple kernels so to preserve this layer of indirection we felt like it was more natural to have this API that is a an open domain API however as you can see just by the fact that there's a string in there executing this has can only be so fast because we need to somehow take this string in these attributes and some properties about these inputs and turn that into a kernel and that means that we can definitely not a good kernels any faster than it takes to at least hash that string so you know there are trade-offs here but we felt that the preserving the flexibility that you have in graph mode was the most important thing here so how do you actually use execute when you call this lining Python what actually happens and execute something that is defined in the tensor flow capi and to use it you do something like this you first create a status so that you can find out if things failed and then you make a new OP you add inputs your set attributes you allocate some memory support pointer to the return values and then you call execute and finally you delete that out and so this is fairly straightforward and if you're familiar with the tensorflow capi for building graphs you'll see that this is very similar so that capi there's this is about as good as you could possibly get for Python code and an open domain setup but it's a little sad here that when you build a TFE up you need to add the inputs to the table this means that if you're executing an OP in a tight loop you can't exit you have to have the same inputs on every iteration of the loop or the or you have to allocate a whole other TFE up for python we really don't have a way of making this better but for other languages like Swift or languages that have access to our compiler we should be able to cache the dynamic bits that involve making a TFE up and separate them from the from story cache the static bits that don't really change like all your memos are the same and separate that from the dynamic bits which are the inputs such should actually change and if you do this you can actually we could make in principle this open domain approach as fast as a closed domain approach and this is a maybe a minor refactoring that we should do at some point so what does execute do if you go for a few layers of API is you end up on this function called eager execute and it has a lot of things here the most the first interesting one is this may be update op device which is you might call it the placer where we get to the side where we're going to execute it separate each operation there's a look at some complicated heuristics in general you can think of it as if you're operating on a resource tensor will run your operation on the device that has that resource because any other thing will fail otherwise if you have a TF that device annotation somewhere will run it there otherwise if you don't have any of those things we'll see what devices are available to execute this operation and run on whatever device you think is going to be the fastest which is how tensorflow gets away with using a GPU for you even if you don't specify with TF that device GPU in there and then you have like some Forks in there about is our local where we remote and once you do know that you're in the local case what we want to do is very quickly do that string processing that we needed to do to find what kernel we should be executing and there's a fast code that takes the attributes and the name of an OP and gives you a cache key that is looked entirely in the context and where we store the kernel and here you might think something's a little funny because you know usually you think of operations as functions not as classes but clearly there's like a kernel in device class in there so we probably have an object and the reason is that for many types of Cronos that we want to execute especially things involving GPUs but also some stateful kernels you want to keep some stage in that object ideally in that state will be managed by the runtime via the resource managers but that doesn't always happen now and once you have a kernel the kernel tells us on what devices it wants its inputs on you would think that the kernel would want its inputs all the device that is executing but that turns out to be too narrow of you for some kernels especially GPU kernels you might want some inputs on the host CPU that's attached to the GPU and the reason is that imagine your GPU kernel for generating a big matrix of random numbers you would like to know how many numbers you're going to generate so that you can run the memory allocation before you in queue your CUDA kernel so if you were to do that if that you know the shape of the random number generator random number vector you're going to generate if that's in the GPU you'd need to fetch it back to the CPU to do that allocation that would be terribly slow so instead tensorflow says I expect that input to be on on the local CPU and this is a function that validates this and but in this case if you're also trying to run a large convolution and one of the inputs is on the CPU this may be copy will move that input to the GPU for you which is faster than just running your convolution on the CPU and then finally we get to decide whether we're in sync or a sync mode where we first create this node that represents all the computation that has to happen to execute this kernel if we're a sync mode with Frodo's in a queue and return control immediately if we're in sync mode we run it now this a sink sink here is complicated because there's another layer of asynchrony that's separate from the fact that our GPU runtime is itself asynchronous this is kind of a patch to make the 10 surface appear on time which is currently synchronous act asynchronously to try to get a little more performance in the eager mode it's a little sad because you lose your error messages when you once you get a set very synchronous there and you currently do not run shape inference in this asynchronous mode I think as we rework the tensorflow runtime which the team has a large effort to do now we have a chance to fix this and have a single code path for synchronous and asynchronous but for now we have this and then finally now we have the kernel we gotta call it that's easy right so to call a Kern we have to make an up kernel context and to do that you to fill this per instruct which I put here which you can clearly read in the flight because it could definitely fit with a very large and readable font so we don't do that this is sadly something that in the original test of flow API for kernels had only one caller which was the tensorflow executor so it was very easy to just add parameters and make the calling convention harder and harder because it was only one place to fix we're now trying to trim this back and simplify it so it will likely get better but for the eager runtime we have this class kernel and device that knows how to call a kernel requiring a lot fewer things about it mostly all it needs is the inputs place for you to populate with outputs and some information in case you want to profile things about how long it takes to execute each node or do a staff or what graphs you're using if you're executing functionally things like that so now that we have this we can run the kernel so what the kernels look like who happens to have one of the simpler kernels we have intensive law it's a unary element-wise off we have a base class for this that handles a lot of the logic around memory allocation buffer we use so that TF Terrel oh by default where we use its input buffer if the root ends of our runtime knows that no other op yet to be executed is going to need this but once all the boilerplate is dealt with all that this kernel has to do is execute the functor and this is another place where you'd be surprised that we use an object where you think we should use a function because in principle reloj is just a function it doesn't keep any state there should be no reason for us to make an object for it except c++ does not let you define a templated function declare a template or function in the header file but special define it in a C++ file and this is something very useful for us because as you can see device is a parameter in there and one of those devices is GPU devices and for GPU devices we like to put the function in a file that we're going to compile with a CUDA compiler and we would like to not compile our entire code base with a cuda compiler so being able to define this function in a single place where we can generate a whole specialization of this class without having access to cuda compiler but have a file on the side that's just going to fill in this implementation after running the cuda compiler is very useful and as you can also see here test flow kernels they tend to be highly templated most are template like this one based on the device that it's going to execute on and on the d types so that we generate fast specialized code for every corner miracle type supported by tensorflow which is an incentive to keep in the set of corner medical type support of a tensor for relatively small as otherwise or binary size will grow but this has the nice side effect that the code generated is very fast and one of the things that makes us generate a very fast code for this which you will see if you look into the implementation of the functors is that we can use again to generate this code for us so the role of factor in particular is very easy to write because it's just a component wise max between the tensor you you're having an input in zero and again this turns out to be a very useful tool to write this it lets us write this code once it will generate specializations for every D type we are interested in and also for CPUs and GPUs for this particular operation you could probably write it in fast assembly language yourself or SSE intrinsics or something like that but for more complicated operations like softmax and others I might have interesting intermediate values that need computing being able to just have this code be generated for you instead of requiring that you write all the device specific and type specific things can save a lot of time also eigen in its core has very very fast gem which is the core like basic map mode that is inside most of our very expensive kernels it ends up being a very big a very large asset in making tensorflow go fast so that was it really it only took us what 20 minutes to get through executing relu I think things also can do it a little bit faster than that but in general this is kind of like what the stack of things looks like as you're executing operations eagerly of course if you've been following this for a little bit you should know that we can do better than executing operations secretly intensive all we have tf2 function and we have graphs and other things that can get you a lot more performance by instead of going down and up that stack for every single operation going down once and executing a lot of operations also does let's into optimizations and stuff so how do we run functions and so yes you could you see you mentioned briefly about the async mode so is that something that is user configurable because there's like context async I don't remember right now if there's a public API to make it user configurable or not but there's an internal API to make it user configurable I believe there was in enable eager execution you could set it so I think you could set it in v1 but it might not be exposed to our clan bgill yes you tailor is correct I think it I know how to I suppose it if you want and do not know how to expose it in v2 but there's probably away or maybe there's not a way yet because we're still treating as experimental I'm not sure regardless the way it is now I don't think it's something we should rely on in the long run so it rather be able to iterate on it a little longer until we start recommending it as the way for people to get performance it's most of the fast bath has special cases some types of denser conversion into fancy code so if you pass a list of Python floating-point numbers we can convert that to an eager tensor without hitting any Python code and that will save you quite some time okay so most of what we saw in right now for the case of executing a single op also applies to executing a function so a function itself is an OP named partition call and you will execute that up like that you have the function internals we'll execute that up just like how we just saw how to execute relu and so the first half until you get to the kernel device run bit is all the same it's just that that kernel implementation is particularly interesting and in function calls in terms of flow they look relatively simple we have inputs we have outputs we have attributes at Alton's of so what types of inputs we have what types of outputs we have and we have a function there are some things in there that seem a little tricky like there's all sorts of configuration we can pass I actually forgot what's the difference between config and config proto in there but essentially this is the big entry point to executing functions in tensorflow but what is if you go look at the kernel of this OP what you'll find is that it mostly just forwards things to the function library runtime and the function library runtime is this core bit of tensorflow that knows about functions it can do things like instantiate and run pretty much and also create kernel which you usually do between stand shading run since that will let you for the function every one time also knows how to execute operations that are not functions so what does instantiate mean and instantiate mostly all the graph optimizations that we might want to run on that function to take code that you enjoyed writing and turn it into code that the executor will execute very quickly most of this processing happens in this function life process function library run time instantiate multi-device call while we run all sorts of graph transformations this is if you have T FX or liberation happening it will run the transformations related to the FX leverage it will run the tensorflow placer what happens of a place where does is it takes a graph in this case a function graph that has devices a scientist some of the nodes and it spits out another graph that has devices assigned to all of the nodes it does this by following a very simple similar algorithm to the one that I described earlier for individual ops so if you have a resource will place that up and actually the resource otherwise if you have specified a device will respect a device even if you partially specify the device will respect the partial device specification and finally well sort the of it will group things by collocation group the tensor flow graph language allows you to specify collocations even though these have very non-intuitive consequences because by co-locating a node a of a node B you can actually move where no B is placed because the Placer is not aware of the direction of the collocation arrows it just groups all the co-located nodes into a bag and find the device that can execute all ups in there so this can have very fun consequences like a bug and helped fix a while back where if you try to speed your distributed system by always co-locating the variable initialization code with the remote device in which the variable is in and you can accidentally say please always co-locate the initializer from a GPU variables on the GPU which can be trouble if you have some of your initializes have operations that cannot run on the GPU you now have silently moved your variables to the CPU which probably is quite a performance degradation so it's very subtle and we're in tf2 we're trying to move away from using collocation constraints inside tensorflow and we're definitely moving away for encouraging people to use collocation constraints outside means of thought I'd rather you be more specific about what devices you're using or even better use something like a distribution strategy that is aware of all these bugs and collocations and can work around them for you instead of trying to replicate this functionality yourself and once you have placed the graph we can run the partition that takes a graph that has notes for many devices and returns mini graphs all of which have nodes on a single device only and to do that if there was any edge that went from one device to another device that gets replaced of a pair of sins and receives this is also what we run grappler in all the function inlining and optimization passes that the last training section with Eugene was covering but yeah so I said this does a lot of heavy lifting and indeed the partition call-out puts a cache in front of instantiate to make sure that we don't call it twice for the single function because otherwise we'll be very slow and once they've instantiated the function we can go to the other main method in the function library one time and run it so in general as you can tell by the name partition call for the app our operations our functions can be on multiple devices and at this point at least the code has simplified enough in there this is actually is inhibited from the core one time even though the core one time has a lot more error handling going on that all we have to do to run at a functional multiple devices is to just run a function on a single the running function is each and a single device and trust that they all know how to talk to each other to make the sends and receives happen so there's some thing called a rendezvous and if you read the tensorflow code base you will see lots of references to it that's responsible for making a sense and receives all aware of each other will have rendevouz that know how to deal with single host and with multi host and there are lots of tricky and interesting bits into how they relate with what is the crack lifetime of a rendezvous how do you shut down and rendezvous once you want to shut down tensorflow computation because maybe some kernels failed and you know some kernels failed and something is receiving a tensor they're never going to get that tensor so we probably need to shut that operation down gracefully this a lot of cancellation relates to that logic and but it mostly at the level of function library one time you can just run your n functions one per device and forget about it and running a function of a single device mostly consists of bringing up the stencil for executor and calling run in it and you'll see that you have things named run async and done callbacks in general we treat all these things as asynchronous so that we can release the calling thread is so quickly as we can so that you can keep on like running more computation on it especially if you have nested function calls treating these things as synchronously is quite the performance improvement and here I could dig into the tensorflow executor but that code is fairly complex and you have a simple core algorithm but it's really hard to pull it out of there and I think the main reason why it's hard to pull it out of there is that the executor grew together with the implementation of control flow intensive flow and specifically the details that we had to do to implement wire loop kind of like obscured a lot of the core functionality of the executor it now is aware of frames and a lot of other complicated things and you have like multi-dimensional pending counts but so I'm not going to snippet that code but I'll say that if you want to read it go for it it's very interesting like highly asynchronous highly parallel interpreter but I'll just give you some of the highlights of what's happening in there and its input is a large bag of nodes and there is no output anything that you want to get out of it you get out of it through a send or receive I mean there's technically there are outputs but by far most outputs in the common case of transfer flow are handle food sender receives and the end state for the executor is that all nodes must execute it or an error has happened and inside things of all the core runtime we have no error recovery so any error will shut everything down this is sometimes unfortunate because some parts of the test for higher level API is rely on errors for the common path for example TF top data raises an error once it reaches the end of a string which means that you can't really easily have a single graph that exhausts a single iterator does some things and then runs another iterator because by the time you've exhausted the first iterator an error is raised intensive flow will shut everything down there are ways of interacting with TF to Dana that do not involve using the iterator get next stop which can fail and we use those inside autograph to make it easier for you to write code that will can recover from these failures and well not recover for these videos it will see no failures when iterating over multiple iterators it's quite nice it's that it has all these cool little combinators like take file and reduce and you can like thread together like three of those like simulator while loop of breaks but anyway pop in a stack here the core algorithm of the executor is while there are some nodes they haven't been executed and no errors have happened execute a viable node and once that node finishes executing you mark all of its output tensors as ready and there's some bookkeeping in there that once your market tensor is ready you look at what ops are gonna be made executable my mac in the tester is ready which marks other nodes as viable and this just like recursively applies in the executor itself it's not a single thing it runs on every thread that is executing an op as soon as that op finishes executing and it dispatches all the execution to another thread pool it's kind of surprising that this is happening because this means the tensorflow cannot really be run on a single thread but some interesting noteworthy things about the executor that you might have guessed from my comments so far but requires some thinking I think one is that the executor is greedy not lazy and if you're familiar of tensorflow it looks very lazy but it mostly looks lazy because we do very aggressive graph pruning and once you start putting control flow and functionalized adhesion a few other things in the executor it actually pays off to have a mental model that says the first pruning happens and then greedy execution happens otherwise you can trick yourself into thinking that some things are not going to be executed when they are in fact going to be executed like my favorites if you have a conditional and one of the branches of the conditional depends on a value that is not in the conditional that value is unconditionally executed even if that branch is never executed which if the executor were lazy that would not be the case but the exception being greedy also makes it easier for you to be able to reason about stateful operations which is very nice given that those exist another thing is that this executor it only looks at very local information like the only bit it has for each node is whether it's ready or not so often there's nothing preventing it from choosing very suboptimal orderings of things to do like if you need to fetch a lot of dancers from a primitive server the executor is just as likely to fetch the first layers denser as it is likely to fetch the last layers denser because none of these things have any dependencies on them and it can be quite tricky to teach tensorflow to actually choose the optimal orderings of things to execute and as I was saying earlier this executor is this thing that there's no single executor Fred it just runs on every Fred as soon as it finishes executing an op so it's kind of this highly parallel little monster so this is it for most of the core TF runtime I just had a few topics that it couldn't really fit very well that I wanted to cover in this presentation just to generate some documentation one is host versus device memory as a hinted at earlier when your partition tends to flow it takes a graph and it spits out in graphs one graph per device each graph per device gets its own executor but how do you deal with the fact that some GPU ops takes CPU tensors so we make a distinction between when you specify an input to a kernel you can say that that kuno expects that input to be in host memory or expects that input to be in device memory and so in fact the executor is for the GPU device can be and most of the time are running a lot of CPU operations and CPU tensors only they call those tensor GPU answers in host memory and so if you look at the tensile floor code you might sometimes see things like a distinction between the device a tensor is in and a device it's a memory is in and a device it's operation is in and this bookkeeping is necessary to avoid mixing these things up and incidentally all resource handles on your host memory but this has a very sad unintuitive consequence that we need to fix which I call and I think other people call it just like TF in 32 problem which is that most of the things that GPU ops take as host memories are shape related things like fill in zeros and random normal they all take a shape and they feel that shape with whatever values you want but their shapes are not static they're often computer based on other shapes the simplest case is when you just use zeros like or something like that where you take a tensor take its shape and use that shape to fill in a lot of the answer but sometimes you're going to reduce some dimensions in the shape broadcast do some other things and TF has this rule or by default it will place every gpo capable up on a GPU device and if you won't find a grained control you just take a large block of coding TF and you wrap it with a TF that device which also means that every off that happens inside a block of code gets placed on that device so if you allow tensorflow to have GPU kernels for in 32 tensors we would keep bouncing these infinitely shapes between the GPU and CPU so you take a shape anyone is lysa to like remove the batch dimension we would copy to the GPU remove the back animation then copy back to the CPU and use it to fill a random normal and that's just sad so what we did instead intensive flow to kind of paper over this and begin this is really sad because this would create a lot of like host device transfers and every time you have one you have to think that you're you stream and you slow everything down so to avoid this we say that for almost every app that has a kernel registered on the GPU that has in 32 inputs or outputs those are placed force-placed in host memory including things like plus gather with actions and other things that you'd like to use as part of a model currently the workaround is use in sixty four four times that you actually want to get executed on the GPU and using thirty-two only for your shapes we have a fix for coming we have code already that exists in both grappler for graph mode and the eager player for eager mode that uses some heuristics and estimates of the cost of train service to cross the computation to try to keep small integer computations on the CPU where they belong instead of bouncing them back and forth on the GPU but there's still some performance regressions that prevent these things are being turned on by default they expected to happen very soon so this is it I if you have any questions could you maybe comment on the difference between the kernel cache and caching the video for functions and the Python layer so the Python layer does a caching for functions from the like here is a Python function all these metadata is completely invisible to the runtime to a concrete function definition that when you go execute their concrete function definition we emit our partition call up or state whole partition call up that then hits TFE executes and the first time the OP has to execute there's a lot of initialization has to happen inside the runtime the an initialization is mostly covered by it's a little bit of it's covered by the kernel cache that mostly tries to figure out what device that's going to be in and the details of the attributes and things like that and then the first time you actually execute it we have the cash inside a functional library runtime which is the thing that guards grappler and all the other graph processing transformations we have to do so it's a few layers of caches to make things fast I don't really know if how it could possibly merge these caches across these layers of the stack but maybe if we unify more things this is going to be possible and maybe it might be worth also talking if you can comment a little bit about the kono cache is something that is very unique to the ego execution versus something that we'd have in graph no we already had I mean we have a slightly different kernel cash in your execution but that we already needed a kernel cash in graph mode too because creating a kernel might require allocating memory which might be expensive it definitely requires allocating a vtable pointer but in some kernels you have to allocate a lot more than that you just had some cases where there was memory bloat because of the kernel cache is there anything that users need to be aware of for that or is that something the runtime needs hopefully or users should not be aware of it there are some cases now where you have to be aware of it like if you're using the v1 random number generator ops the only way to reset the random seed requires we sat in the kernel cash because States kept in the kernels as we move to the v2 Brendan number ups we don't make this type of mistake and so the space overall taken by the kernel cache should become much smaller I also think for the particular case of functions which should be able to garbage collect that cash a little better yeah okay yes so when X 3 is enabled we don't have this shape into 32 actually has a different notion of kernel and computation so by giving up the possibility of having any dynamic shapes at all it can effectively actually only works when you can constant fold all the shape things away and then it doesn't matter if you constant folded them on the GPU or constant folding them on the CPU but they're not there this is only a problem when you have runtime dynamic shapes most of our runtime is ref counted in our garbage collected yeah if you there's a class in tensorflow core lib core ref count or H it's a base class for a ref content pointer and we use that instead of shared PTR because it's a smaller memory footprint and it has better cache locality behavior so you should be able to just read that and find the places of the runtime there inherit from it and you can see the things which are ref counted but we currently have no garbage collection for the kernel cache the condo cash is non garbage-collected correct but almost everything that can be ref counted already is ref counted a few things like the colonel cash or not because it ref counting cash is feels weird but in some cases like when you're cashing the colonel for a function it actually makes sense to ref count it can we put a limit on the colonel cash in principle we can do yes it's a you know memory versus performance trade-off assuming we are not dealing with the v1 random number ups because those if you are a victim from the cash you know change the sequence of random numbers you would get and that's pretty bad okay [Music]
Original Description
Take an inside look into the TensorFlow team’s own internal training sessions--technical deep dives into TensorFlow by the very people who are building it. On this episode of Inside TensorFlow, Software Engineer Alex Passos discusses the eager execution runtime. Let us know what you think about this presentation in the comments below!
Watch more from Inside TensorFlow Playlist → https://goo.gle/Inside-TensorFlow
Subscribe to the TensorFlow channel → https://goo.gle/TensorFlow
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from TensorFlow · TensorFlow · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
The TensorFlow YouTube Channel is Here!
TensorFlow
Answering Your TF Questions #AskTensorFlow
TensorFlow
Chatting With the TensorFlow Community (TensorFlow Meets)
TensorFlow
All About TensorFlow Code (Coding TensorFlow)
TensorFlow
TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow
Keynote (TensorFlow Dev Summit 2018)
TensorFlow
tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
TensorFlow
Eager Execution (TensorFlow Dev Summit 2018)
TensorFlow
Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
TensorFlow
Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
TensorFlow
The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
TensorFlow
Distributed TensorFlow (TensorFlow Dev Summit 2018)
TensorFlow
Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow
Searching Over Ideas (TensorFlow Dev Summit 2018)
TensorFlow
Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
TensorFlow
Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
TensorFlow
Open Source Collaboration (TensorFlow Dev Summit 2018)
TensorFlow
Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow
Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
TensorFlow
Real-World Robot Learning (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow
Project Magenta (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Dev Summit 2018 - Livestream
TensorFlow
Introducing TensorFlow Lite (Coding TensorFlow)
TensorFlow
TensorFlow Dev Summit 2018 Highlights
TensorFlow
Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
TensorFlow
TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow
Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
TensorFlow
Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
TensorFlow
TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow
Using the tf.data API to build input pipelines (TensorFlow Meets)
TensorFlow
Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
TensorFlow
Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
TensorFlow
TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow
Get started with TensorFlow's High-Level APIs (Google I/O '18)
TensorFlow
TensorFlow for JavaScript (Google I/O '18)
TensorFlow
TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow
Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018
TensorFlow
TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow
TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow
Advances in machine learning and TensorFlow (Google I/O '18)
TensorFlow
Distributed TensorFlow training (Google I/O '18)
TensorFlow
Classification using neural networks & ML regression models #AskTensorFlow
TensorFlow
TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow
Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
TensorFlow
How to get started with AI/ML, retraining models, & more! #AskTensorFlow
TensorFlow
TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow
MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
TensorFlow
The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
TensorFlow
At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
TensorFlow
NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
TensorFlow
Try TensorFlow.js in your browser (Coding TensorFlow)
TensorFlow
TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow
How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
TensorFlow
Training models faster with TensorFlow Hub (TensorFlow Meets)
TensorFlow
Prepare your dataset for machine learning (Coding TensorFlow)
TensorFlow
Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
TensorFlow
TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TensorFlow
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI