Inside TensorFlow: Resources and Variants
Skills:
ML Pipelines80%
Key Takeaways
This video covers TensorFlow's internal mechanics, specifically resources and variants, with a focus on state management and dynamic C++ types in graphs.
Full Transcript
hi my name is Alex and I'm here to tell you today about resources invariance and really this is a talk about state intensive flow and stuff that got accidentally represented as stating tensorflow for far too long so what is state I would love to be able to stand here or rather sit here and tell you that an operation is stateful if either executing it has a side effect or if its output depends on something other than the value of its input but this is not what tensorflow means by stateful sadly tensorflow goes by the Vince tiny notion that the meaning of a word is defined by its usage so state intensive flow is defined by this one that gets flipped that means all sorts of very interesting things so for example this slide is wrong TF print is stateful it has a side effect yay TF dataset from tensor slices has no side effects because the data set operations are value types in there stateless and yet that kernel is marked as stateful because one of the effects of marking something a stateful intensive flow is that it disables cool constant folding and constant folding can be buggy with datasets iterators on the other hand are stateful so this might let you to think that there is some meaning to this but there are also some things intensify that could go either way so to differentiate while loops we have stacks so that when you're doing the forward pass of the loop you push things into a stack and when you look doing the backward pass you pop things from the stack so you can look at intermediate activations and stuff and those things are stateful in TFP one but there are stateless in TFE two tensor lists that you can use to like aggregate stuff from the from many iterations of a loop into a single view or do the reverse they're also stateful in TFP one and stick listen tfv two so how do we represent because because it is not invent a stateless way until later yes so this is I want to spend the rest of the stock talking about how statefulness is represented in TF v1 some of the problems with that however fixing those problems in TF v2 and how we can deal with State and also with things that are not necessarily easily representable with dense tensors so how is statefulness represented in one of two ways the most obvious way is that if you'll go on the test of our source code you find where opps are registered you'll see this bit set is stateful and the definition of state intends to flow is that up deaths that have this bit set are stateful and all sorts of places in the run time we're going to look for that bit and behave differently if there bit is such and people set the bit because they want any of those behaviors and this is something we need to clean up and I think we might have a chance to clean this up with the mly our dialect of tensorflow which is gonna have more finer grained bits but until then we're stuck with this one bit that has like too much precision so among other things what does this bit mean it means that 10 so far not too constant folding this includes the two or three separate systems intensive flow to do constant folding all of them know how to bypass stateful operations similarly there are at least two different places intensive flow to do common sub-expression elimination and they refuse to do common sub-expression elimination of stateful operations which is very good because if you were to do that and you have on that neural network of many layers all of your layers are initialized from a random off all the layers with the same shape would be initialized exactly the same random values only prints of the identical string would be collapsed into a single print because otherwise we would have enough information to these integrators but statefulness also means some things that are not very obvious at all like the opt kernel instances that the runtime uses should represent the computation to run are reused across sessions for up kernels there for stateful ops they have the same name and there are also a somewhat long tail of obscure behavior changes like parallel for behaves slightly different for stateful operations and people are known to set the state for bit for any one of these reasons and more the other way of representing state in TF that we're trying to get rid of in TF v 2 is the notion of a raft answer and going back to the variable up it is this thing here where you can say that our tensor is either of a d type d type or of a d type ref of d type and what that means is that deep I mean the reason why we did that is that it's very convenient in many cases to be able to keep information in the runtime that persists across cost recession dot run specifically the variables like if you had to write your code like this where like every session don't run you'd feed your variables and then you'd fetch them back and you were doing some kind of distributed training you'd have so many Network round-trips and so much extra latency for this it would be completely impractical so the idea of the variable off which is the thing the motivator the raft answer is like a constant but immutable and if you try to dig for the runtime you'll find this piece of code which i think is the most concise representation I could find of like how do we represent a distinction between a raft answer and another tensor this is what the inputs to an OP kernel looks like and it's essentially a manually implemented ABS l1 off where one it's either a manually constructed tensor and the manual constructor is in there just so we don't try to initialize it in case we're not going to need it or despair of a pointer to a dancer and a pointer to a mutex and if you've ever programmed in C++ you should be terrified right now because you see a pointer and you see no comment about who owns this pointer and what is the lifetime at that point and that's a good like third of the issues of ref variables come from the fact that it's been impossible or very hard to reach or fit into the system a coherent notion of ownership of this pointer that's going to be memory safe but that's not all the way the ref variables work is that you have a graph that looks like this you have this variable node whose output is a tensor that can change and you feed it you can feed it to an operation that mutates it like a sine or you can finish an operation that does not mutate it like identity if you feed it to an operation that is not mutated like identity the tensorflow runtime will silently cast that tensor star to a tensor so make another tensor object that alias is the buffer pointed by that tensor and just keep going so the reason why I like this graph is that it's short it's simple if you look at every single gradient update that we use for training it kind of looks like this but it's also kind of tricky so we have I don't know like 20 30 people in the room now can I get a show of hands and who thinks that the result of the print is the value after the assign no one so this this graph does it has an ad that takes this input the identity of the variable and then in some constant and it prints and it has a control dependency from an assign the mutates the value of the variable to the ad so how many people think this is enough to ensure that ad will see the value of the variable after the assignment okay about like five or six how many people think that ad will see the value of the variable before the assignment about like two or three hands how many people think this is a segmentation fault no one and how many people think it depends on things that are not written in the graph okay so all of you have been bitten by this because this I got like 15 hands now it is completely non-deterministic and it depends on all sorts of runtime properties for example if everything is in the same device and the assign does not change the shape of the variable because of the way we do aliasing inside the tensorflow executor print will print the value after the assignment however if the ad is in a different device up from the variable then most likely there will be a an RPC and ad will sometimes see the value after sometimes see the value before the assignment there is one case where ad is guaranteed to see the value before the assignment which is if the assignment changes the shape of the variable because if the assignment changes the shape of the variable due to like intricate details of the implementation of tenser and tenser buffer we do not change your existing tensor buffer we just allocate a new one and by the time the identity runs it has already alias the old tensor buffer and you can get a seg fault here as you might have guessed if we're talking about string D types because ad is defined for string D types and if you have two separate threads that are reading and writing to a string in C++ you're very likely to get a seg fault or some other weird behavior so this is pretty complicated and kind of unworkable in the long term that you need to know about also things that are not well documented and that are rely on specific details of the implementation that are not going T to stay stable and if you were to try to design something like a compiler for tensorflow this will be really hard to make work so we're not doing this anymore and I'm gonna spend the rest of this talk hopefully telling you about how we're fixing this and imposing some order on to the situation in tf2 and the interesting thing is that internally the very variables have always been represented in terms of not almost always I guess since the first open source release is the state has been stored in this resource manager object which you know has a create a lookup and a delete method and these can return some arbitrary type we use some are TTI magic to make sure that the code is run time type safe we even implement our TTI on compilers that do not have our TTI to make this work our TTI sorry is a C++ paying for one time type identification and this is a perfectly reasonable API if you wanted to represent state that's outside of the graph so the idea that we had in tf2 is let's use this represent the state as essentially operations in the graph so there's still some issues of resource manager like it's scoped to device objects and device objects have a weird lifetime sometimes they outlive a session sometimes they do not outlive a session and it's slightly different in with your execution and you this can be very surprising in some cases both when you doing parameter server training when you're not and you accidentally find yourself in parameter server training and intentionally sharing parameters between two models are not supposed to but overall it's a reasonable API so what we did is we created a tensor D type just like string or int or float that represents the information you need to look something up in a resource manager and we call this creatively DT resource the reason why this is a tensor is that this is just another value so you can pipe it through a graph you can put it in you can stack things together you can select them dynamically if you want or you can just use them statically it's just a scaler most of the time like you can have a non scalar tensor of DT of resources but most of the interesting operations just want scalars and then you can use the standard to manipulate a resource so internally it's again just the information you need to make the look up to the resource manager minimally typesafe so device a container a name the container was this idea that we had originally that you would be able to run many separate models on the same parameter server and provide some kind of isolation where you could reset the variables in one model but not the variables in the other model the way this was implemented made is very hard to use and I know of very few people who rely on this now so these days it's mostly just a watch but otherwise you know it's is a name and like some information to validate that you're looking up the object of the right type but that's all there is to it and resources are special cases in like a couple places in the runtime not as many as the stateful bit and like one of them is that if you create an OP that specifically manipulates either takes or returns a tensor of a resource D type we mark it as stateful because you assume that if you're asking for a key to something in a resource manager you're probably going to monkey around with it and this at least remove is removes the redundancy because otherwise you'd have all these up so we take resources modify State in the resource manager not be marked as stateful and you'd have to wait until they got accidentally constant folder together to see something break and the second one is that the Placer will always co-locate operations that manipulate a resource with the device where the resource is in and this is because you know you can't really modify a structure that's in another computer without running code on the other computer but mostly we saw Smit resource handle is safe in the run time an interesting thing is that now our graph that was very hard to read looks like this you have this VAR handle up that is represents the resource handle the key and you can pass that key to your assignment you can pass the key to your read operations etc and now I'm pretty sure everybody should agree with me that this graph as written has to return the value of the variable after the assignment otherwise it's about and this is true there is no like weird non-determinism doesn't matter whether the shape changes or doesn't change well the type you're dealing with what device things are on also there is no way to make this segfault I believe so it's substantially nicer there's still some subtle things in here one of them is resource gather it's a operation that you think why would I need this because if you what it does is it looks up it does effectively what read Plus gather too but it does it in a single log and the reason why we have this is that if you think about this if I really wanted to provide proof forever that this graph is allowed to have the same meaning of always reading the variable after the assign and if you had flipped that control dependency between read and assign you will now be always reading the variable before the assign you might have to make a copy to ensure that this memory is preserved and if you have a very large vector of embeddings making copies of it can be very expensive and we would like to provide good performance so really this resource fing is more a specification of the meaning of a graph that has these operations unless specific details of how they're implemented it's possible to have many valid implementations of this and they're going to have different performance characteristics so for example if we lower graphs to excel a for computation actually can take a cluster of ops they have a bunch of reads and writes to variables look at the state of the variables before that cluster figure out what the CETA variable should be after the cluster and we write it to be a bunch of reads some stateless computation and then a bunch of assigns and this correctly preserves the semantics of these operations and it's a perfectly valid way to do this we don't always run excellet though and if you start thinking about this there are two relatively straightforward ways you could implement variables and they have pretty strong performance trade-offs a very obvious implementation is copy-on-write where you would copy the buffer for a variable every time we write to it another one is copy and read where the read operation is going to copy and in the assign operations just always going to mutate the interesting thing is a copy on write if all you're doing is your standard STD training where you read a bunch of variables in the beginning do a bunch of forward and backward computation and then you write it a bunch of variables you can do this with zero cop because by the time you're right into the variables there are no outstanding reads left so yeh similarly if you have embeddings and you're sparsely reading a few rows from your variable in arbitrary random order and then later on you're going to sparsely write to those rows we can do this with no copies if we have copy and read I mean no extra copies since the reading would have to copy anyway because it's reading in an unstructured way that we couldn't preserve strides or something like that but regardless so which one do we choose and effectively we chose both and we did this by storing a bit on variables and having variables always start in a copy-on-write mode and as soon as you do any sparse operation on a variable we grab an exclusive lock and make any copies that we need and put it on copy on read mode this works reasonably well for both the you only use this variable in dense operations case and the you only use this variable for embeddings case it's not necessarily generally the best idea so I hope I expect this policy might have to change and become more refined over time but again this is just an implementation detail and this is does not affect the correctness of the programs that are running on tensorflow so I think it's a big improvement so I thought when we beat something effectively it makes a copy it seems like this pretends to emit a copy so the definition of the definition of read is that a read is guaranteed like on an operation that looks at the output of a read is guaranteed to see the effect of every operation that had an edge pointing to the read and not see the effect of any operation that had an ad pointing from the read you can implement this by making a copy and read you can also implement this by making a copy on write you can also implement this in like more complicated ways that might never make a copy all that matters are default copy and write looks the reference count and if it's one just updates in place and our default read operation just increments the reference count yeah copyright semantics do that and I assume we're gonna we're going to eventually switch to more complicated policies for example we could look at the graph and then decide what policy we're going to use to write the variables on this graph and or we could let users configure this or we could there are like many many options here but ideally we should be able to implement all of them without requiring the users change the graph structure to get better performance so to get correctness of their behavior and this is what's important about this because this means that like we get to fundamentally in dramatically change the backend like use a compiler and not have to worry about preserving like bug compatibility what happens with you a list year output of identity on another variable or something like that so so far I've mostly focused on how the runtime treats variables but the same fundamental patterns of a handle tensor and operations a read and write to it is used in all sorts of other bits of runtime state intensive flow this includes the data set I iterate heurists fifl Q's hash tables and a few more things that I have forgotten so for the our view cases these mutex is a there are resource but they also have a variant that represents the mutex lock object so it's a more slightly funner situation but as far as the resource part of the mutex is concerned it's again a mutable resource tensor that has a handle it has operations to modify engine so this is nice and this is this essentially what the runtime looks like and if you have this picture in your head you should be able to mostly predict the behavior of tensor for programs that manipulate state one other bit is tenth of all this shape inference I'm sure if you've looked at interval of registrations you've seen annotations like this where we set shape event the shape result of shape inference is not persisted in the graph it's ephemeral it's produced every time we create a graph or while we're importing a graph but this is very very useful to ensure not only that we know how to interpret the graph correctly the graph is valid but this is very helpful during the graph building process where user code can inspect the inferred shapes of nodes and make different decisions as to whether things can be dynamic or static in the graph and if all the resources are scalars this would make it harder to do with shape inference on stateful operations and manipulate resources so we did kind of a hack that should be improved and add a design channel to the shape inference process dislike output handle shapes and types that can store an arbitrary list of shapes and D type objects in different resources invariants are going to assign different semantics to this operations that light cast that do not affect the shape just pass the shapes and D types through and then operations that are aware of what the resource handles are doing are going to look at this in assign meaning to them so variables just store a single shape in D type for the value of the variable tensor lists store a shape and D type in there for the shape and D type of the elements in a tensor list iterators store the shape in D types of all the tensors that you're going to get when you call iterator get next so that we can properly do shape inference on those graphs so now that you've mostly have a reasonable picture of what resources look like in the runtime I'd like to pop the stack and talk a little bit about the Python Python side so this is gonna mostly focus on variables because I think there are like a few interesting things in there that will again generalize to other bits in the runtime the first one is that if you views tensorflow before you know the variables act like dancers you can pass them to operations you can use the operators on them and part of this reason is historical I think the first implementation of variable in terms of flow was literally just a return value of the variable OP and that happened to be a tensor of reference D type later we felt the need to replace that with a class so we worked somewhat hard to make that class behave exactly like a tensor and this is something that sometimes library writers downstream from tensor flow want to have their own types that behave like dancers or behave like variables so how do you do this and I think I strongly believe this is all you need first thing to do is you need to make your type convertible to a tensor so there is a TF the registered tensor conversion function that takes a type and a function to convert that type to a tensor in the case of a variable it just reads the value of the variable easy there are some special cases in there to do reference types that are no longer needed thankfully another thing that you need to do is register your type as a tense tense tense or like type which means that implicitly f does stack by just putting many instances of that type in a list or work by silently reading and then calling stack then you need to overload all operators and if you look there's this method overload all operators in the class with you have a variable that has an implementation for this that would steal all the operator overloading some tensor and there's a rule in tensor flow that session of run is not allowed to add notes to the graph this can catch all sorts of terrible terrifying bugs so it's great that we have this rule but if you want to be able to fetch the value of a thing then you need to implement this underscore as graph element method which session dot run pokes to see if it is there which is supposed to return a pre-existing tensor and this so variables have to record a tensor that is going to be the result of reading them and to store in there so that you can session our user session or run to fetch them there's also one more tricky bit about the Python implementation of variables that you might need to know which is that in ref variables because they can just convert to a ref dancers the following work you can take the return value of an OP of an assignment operation and call another assignment operation on it and do it as many times as you want because assignment operations chain and in resource variables clearly the assignment operations they don't have a return value because if you were to return something like the handle or the handle it's useless it's the same as the input no point in returning that if we're to return the value of reading the variable now that's an operation that might potentially be very expensive and you'd like to not read it unless you're going to need to read it so we added this notion of unread variable which is a class that if you have a control dependency on it it just has a controller - you on an assignment operation but if you try to read this value it's going to read the value after the assignment operation and because this acts like a variable we can make it use this to make the change assignment work and a few other things so if you see unread variables in your graph you should know this is kind of what you're dealing with but if you've been paying attention you've seen that the core set of operations for a variable does not self initialize and this is by design like a lot of the early use cases of tensorflow were optimized for shared parameter server training and in that case when you have moto parameter servers and multiple workers all talking to each other you might want to initialize variables from scratch you might want to load them from a check point and depending on your training policies you might want to do different things so the graph is agnostic as to how you do those things the runtime is agnostic how you do things in the execution like the session that run gets to set of policy this is very important because we have to change the policy many many times until we finally made it mostly bug free in estimator but as we have to as we're not necessarily saying that the default way to use tensorflow is shared parameter server training we went for ergonomics over safety so in TF v2 we mostly variables are initialized on creation an eager execution this is very easy to do because as soon as you execute the ops you create a variable we initialize it for you in TF dot function it can be a little trickier because you the initializer for a variable might be defined inside a function and there are a few ways to handle this and this is I'm gonna go into detail in this in the TfL function talk similarly variables sharing is a complicated issue if you're doing shared parameter server training you'd like all the workers that connect to the same parameter server to see the same variable so they can see each other's rights to those variables and the way we did this was to say the variables are shared by name so in tfv one variable names are load-bearing if you change or edit or modify the names of variables you dramatically change the behavior of the program this is a questionable decision in all cases because variable names you like they look very harmless when you read code so in tf2 we chose to make names non-load-bearing internally we're still using the code the runtime that assumes a load bearing name but we always use a UID to hide the effect and if you want to have shared names for our parameter server training you can because you can control the detail on the runtime but the Python API no longer makes that straightforward and now we might be asking about how would I be able to change how the details of variables are implemented another thing that we're adding into fv2 is this notion of a variable creator the lets you control how variables are created and so variables are a meta class so that when you call TF double you might not actually get an instance of variable you might get an instance of some subclass of variable that defined some specific behaviors in TF p1 by the phone you get ref variable into F v2 by default you get a source variable but in other contexts you might get other instances the metaclass code itself is not particularly interesting is just you should probably know this exist if you're dealing with variables in Python so one instance is like TF dot function uses its own subclass of variables that behaves is slightly different from the v1 graph resource variables when it comes to initialization so that you can capture initializes and things like that and it's nice that we can keep that code encapsulated within the TFR function package and not like push its complexity out to the same variable class that is used everywhere similarly TF not distribute might need to create for replica variables or mirrored variables with complicated read and write modes and that complexity can be mostly focused centralized on a TF that distribute package instead of being spread out all over tensorflow so when you're inside a distribution strategy scope and you create a variable your distribution strategy is probably setting up a variable creator that's going to do the right thing for you and this is very important in TV use and in mirrored and stuff so it's good that we have this kind of flexibility but just like how creation is configurable deletion can be a little tricky so a nice side effect of having load-bearing names for variables in TF v1 is that it encourage you to have a very very few of them and to think very carefully about what each of them was called so the set of variables throughout the lifetime of a tensorflow program was mostly fixed which meant that deleting variables is mostly not a big deal and we could get away with like very broad wide big hammers delete variables like session to reset but into fv2 is very easy with via execution with functions to create a lot of variables and you can create temporary variables so we do need to clean up after ourselves so we're going to have memory leaks and you'd think that since this is Python you should be able to just override DAL to get variables to clean after themselves but it's not that simple it turns out that if you override DL on an object in that object becomes part of a reference cycle and if you've ever looked at implementation of deft are variable you'll see it has tens of members so any one of them could point to something that could point to something that could point back to that variable and if you have if any thing with a DL is part of a reference cycle that entire cycle becomes uncollectible and we have leaked that memory forever however there is an easy workaround which is that if you make an object that is guaranteed to only have one or two data members that cannot possibly be part of a reference cycle you can over IDL on that object and then take an object that's complicated it might be a part of a cycle and store a pointer from that expensive object to the small cheap object that knows how to do the clean up this does not make the cycle uncollectible and still guarantees that the cleanup happens in the first object goes out of scope now the worst that can happen is that our reference cycle means that your garbage collection is not immediate it's just delayed until whenever the Python garbage collector decides to run but that still göran T's correctness in a lack of leaks even though it might be a little surprising that if you use sufficiently complicated objects your GPU memory might take a while to clean and you might need to use pythons juicy morsels to force it to clean up it after itself in this pattern of making it the liter object is used everywhere in the tensorflow code base that we have resources and that we need to override DAL just to ensure that we have orderly cleanup so that's essentially all you need to know about resources to effectively use them intensive and now I would like to move on to talk about variance and I put those two things together because for the longest time there was a conflation of views between resources and variance because resources were like the easiest way to just hook arbitrary C++ code inside the tense of the runtime but it turned out that a lot of the things that we were doing using resources to do we're better served by not arbitrary C++ code but by stateless operations on immutable values and why would you want that mostly because stateless things on immutable values are much easier to compile and they're also much easier to differentiate through and differentiation is something we really care about so man JUnit had the idea of making a separate D type variant for immutable arbitrary C++ stuff its implementation is very very similar to something like a BSL any and other like arbitrary types like dynamic types in C++ with a few bells and whistles to integrate better in the TF ecosystem so a canonical example of variance is the tensor list ops which are used under the board to implement stacks intensive lv2 in tensor arrays but also there are one of the original motivating factors and they look like this you can have an often makes an empty tensor list then you can have another op that takes a list and a value and spits out a new list that represents the concatenation of those things and then you have an OP that takes a list and spits out a slightly shorter list and evaluate remove from the list and you can inspect those values and manipulate them and the fun thing about these is that because these are all immutable you can easily define their gradients and if you think about it the gradient of porcius pot they were in the pot is push the gradient of set item is get item you all it Amira's very nicely so you get code this like efficiently differentiable up to higher orders but if you've been and internally the tense of this structure can be very simple it's just an STD vector of tensors and some metadata about shapes and D types we need these methods encode and decode so that we can sterilize in deserialize lists in case you need to send them across devices there were specific variants are can choose to not implement those methods and throw errors instead and if you think following this though and you saw the previous slide where I had a city vector and you saw the slide before that where the officer take one and returned a new one you might have been terrified that this had automatically made every single recursive neural network all over N squared but since the tensorflow runtime has this nice optimization where you can ask a kernel is allowed to ask the runtime if anyone else is ever going to use one of its import answers again and if the answer to that question is no the kernel can go and mutate that answer so this incidentally is how tensor lists work and in the normal use cases like when you're using them for stacks after you push something to stack there are no more references outstanding to the previous value of the unwashed stack so we can just reuse its memory and append and get exactly the same like event performance that you would expect to get from the stateful version however we're doing this with stateless operations so we got to differentiate through this code and if you do end up having holding an extra reference to something that you want to mutate or apply or mutating up later the system will silently do a copy behind you to ensure the correct behavior and this is also good because we again managed to decouple the behavior from the implementation so we can take operations have exactly this meaning give them to a compiler and the compiler might be able to alight the copy if it can prove that it happens at some point in time or use a different internal representation for these tensors yes this copy is just the concert copy of a vendor of tensors thanks for buffering ourselves never need to be copied because that's a separate separate level but again even if you just copy the vector of tensors you can still see that show up in some profiles so one more thing you need to do if you want to define your own variant D type and have it work seamlessly with automatic differentiation is you need to tell tensorflow how to add two of these and how to make a zeroes like because these are operations that auto-da-fé needs to do all the time it's not obvious why altered F needs to make zeros and have you talked about this some other time it has something to do with differentiating operations that have multiple outputs in doing that in a single bit of code that doesn't have to be aware that some of those output 7 might not have been used so they do not have upstream radians so essentially this is it like this should be all you need to know to like understand how state and how arbitrary c++ stuff is represented in tensorflow there are many other Verity types other than the tensor list that is just one there was one of the first ones and it's one that shows showcases all the little bits in there which is why I chose to talk about it similarly there are many other resource D types are that in the variable one but variables by far the most complicated so if you understand how that works you should understand all the others happy to take questions now but if you're watching this and you when you're not on this room you can email your questions to developers at tensorflow org with where we have discussions about in internals the one control type of a source for variant or yeah that's a questionable decision I think it mostly comes from the fact that originally Tessa fell did not make it very easy to add new D types there are all sorts of enumerations and specializations that have to happen on a per type basis so having a hook that lets you easily add a type without change about any changes to the runtime what's considered important I don't necessarily think that this is the end stage and we should maybe at some point in the future we should stop representing this as a variance and start representing them as a list D type which will give them you know a lot of the runtime to specialize to them in a better way but in in the full case editor has become string a sort of an end we'd have to stop having switches up from D types everywhere in our code base but it might make sense to read list as one of the ins but again HD types interesting so dramatically increase the chance of a binary science because we need to register all sorts of kernels for all D types even if we don't have to a few linked unintentional side effects today and you know it makes sense to specialize to a small set or things like fast dense buffers of numbers because that's most of our expensive computation what have even some of the common pitfalls that you've seen that people had like buggy or AC code initially and on many many many many many bugs would they go for yeah so yeah the one that most people see is that if you go back to this guy yeah s graph element one unintended consequence to this is that because sessions are run is not allowed to create a new operation in the graph reading the variable has to pre create a tensor to read its value which means that if you fetch a variable on the same session to run step as you have done some mutating operation in the variables of resource where you're guaranteed to see the value before the mutation like the read will happen before they assign just because they don't have any control dependencies no not going to see that you're almost going to because they don't have any control dependencies either way so you get non-deterministic behavior but the read is cheap and runs fast and it has no dependencies on it while usually you have to compute something to get to the assign while of ref variables because of that aliasing behavior you are almost you are fairly likely to see the value after the assignment under the assumptions that everything was on the same device and stuff so you have all sorts of unit tests that people write that get confusing this I think is the only regression we've had if you look at bugs from ref variables there are many you see them in like sanitizers like spread sanitizer another sanitizer fire up on the tinsel for runtime often due to those race conditions involving variables my favorite one is combination of control for v1 and ref variable because control for v1 on the conditionals doesn't create a single predicate it creates many switch operations and if the inputs it'll switch operations is the reference variable and one of the branches assigned to the variable then half of your switch operations are going to execute one branch and the other half is going to execute the other branch of the conditioner and with that tensor propagation this can lead to bizarre undefined behaviors this is a very fun one and it doesn't another problem is that you can have optimizations that you can apply for example grappler likes to rewrite things like tensor plus zero two tensor because sometimes that zero might have been added there by some complicated graph that grappler just finished a constant fold and prove that it's a zero and due to the implementation details of ref variables plus is guaranteed to copy a variable so if you wanted to copy a variable so that you could have its value before a right to compare it with the value after a write it find out by how much it changed and graph the rewrites your plus zero to just keep it the value of the variable now your program has been broken so it you have all these like very subtle interactions between things that you would think are harmless so you see this you see a few like hedge patterns in tensorflow a lot you see people putting like lots of identity tensors on different devices to force a send and receive to force a copy you see people you also have the gradient code has this fun fact where like if you're back dropping for a deep neural network once you've computed the gradient of respect to the really respect to a variable you can do two things you can update the value of the variable or you can compute the gradient of respect to the input layer we're computing the gradient of respect to the input layer is a matrix multiplication or transpose convolution between the value of the variable in the upstream gradients so if you've already mutated the value of the variable you now have your now computing the raw ingredients and so the tenth so this leaked into the gradient code which has a gait gradients argument due to ref variables so that it protects the backdrop from being affected by assignments to the variables which has a has side effects which means things like we lose performance like the topmost layer of a neural network you only need to compute the gradient aspect to the variables not respect to the inputs but because of the gating code we have to force the computation of respect to the inputs so that you can guarantee that if there was a layer before it we would have updated that variable before we had seen the great introspection those inputs it also does not allow us to overlap variable updates very well with the gradient computation it's I can keep going there's a there are logics there's a lot of code in the send/receive scheduling that tries to prevent complicated deadlocks that can happen when you have several assignments and also complicated cases where you tend to not send the value of the variable that you thought you were sending yeah there's a lot of code that we will be able to delete if we no longer have to support ref variables and there so in a lot of this code is like complicated and buggy and very hard to maintain a server severe question do you know anybody who is actually lying under a variable behavior yes so there are lots of that issue that I talked about the plus one there's this thing called the global step in estimator that is incremented on every training step and is read on every tiny step and every estimator user has a bunch of hooks that rely on checking the value of the global step after every training step so everybody who's doing estimated training in a single machine case is effectively relying that they can read the global step after a right by just separately fetching it and they don't care necessary if the value is out of sync between a different range the ref behavior ends up being that they get the value after the assignment because it's an int variable gets forced placed on the CPU it has all these like silent requirements they conspire to allowing people to rely on this so our own code relies on this quite a bit in practice it's not a big deal because most estimator users are doing on distributor training and when you do distributor training your variables and up on other devices and you no longer have this grantee that you will always read exactly the value after the assignment so all the cooks have to be robust or not reading that but the unit tests for the hooks rely on the fact that they all run on the same device that is a big one I have seen some cases where you might rely on the fact that you can do both snapshots and like sparse rights to a variable efficiently in the ref variable case with race conditions like if you're implementing some neuro computational memory thingy you might want that behavior and that's one of the cases where I think we might need to just implement a separate policy for the how did your variables make it work because we can specialize the runtime to the order D types and make it faster like if you have variants you have to do more runtime dynamism to figure out like gonna add two floats now you have to check a runtime their floats and you need to like rewrap them into the variant thing which has one extra pointer dereference and we didn't know which is incidentally one of the reasons why you want to move this out of variances so we can get better type checking for them so that like you don't accidentally add like a list and a mutex together or something like that it will be very interesting if we could extend the notion of type in the tensile photograph to include a little more information than just an enum I think that's a separate project I don't know if anyone is working just in flow or like you know all of these and incidentally we also don't have very good coverage of our type so as of the recording of this talk a lot of the Uintah types only work inside XLE and they sometimes work outside of excel a if you go for the ops they don't actually look at the types like identity but if you try to do operations on them most of the kernels are not registered and there's no good reason for that other than like binary size in legacy it's just you end up with lots of holes when you add more details that are take some time to patch okay I think this is about as much time as you have so thank you very much [Applause]
Original Description
Take an inside look into the TensorFlow team’s own internal training sessions--technical deep dives into TensorFlow by the very people who are building it!
This week we take a look into resources and variants with Alexandre Passos, a Software Engineer on the TensorFlow team. This training session goes over how state is managed in TensorFlow, and how dynamic C++ types are supported in graphs. We explore the stateful bit, ref edges, dt_resource tensors, resource variables, tensor lists, and variant types.
Let us know what you think about this presentation in the comments below!
TensorFlow on GitHub → https://goo.gle/2HpX3V5
Watch more from Inside TensorFlow Playlist → https://bit.ly/2JBXFtt
Subscribe to the TensorFlow channel → https://bit.ly/TensorFlow1
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from TensorFlow · TensorFlow · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
The TensorFlow YouTube Channel is Here!
TensorFlow
Answering Your TF Questions #AskTensorFlow
TensorFlow
Chatting With the TensorFlow Community (TensorFlow Meets)
TensorFlow
All About TensorFlow Code (Coding TensorFlow)
TensorFlow
TensorFlow: an ML platform for solving impactful and challenging problems
TensorFlow
Keynote (TensorFlow Dev Summit 2018)
TensorFlow
tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)
TensorFlow
Eager Execution (TensorFlow Dev Summit 2018)
TensorFlow
Machine Learning in JavaScript (TensorFlow Dev Summit 2018)
TensorFlow
Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)
TensorFlow
The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)
TensorFlow
Distributed TensorFlow (TensorFlow Dev Summit 2018)
TensorFlow
Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Lite (TensorFlow Dev Summit 2018)
TensorFlow
Searching Over Ideas (TensorFlow Dev Summit 2018)
TensorFlow
Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)
TensorFlow
Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)
TensorFlow
Open Source Collaboration (TensorFlow Dev Summit 2018)
TensorFlow
Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Hub (TensorFlow Dev Summit 2018)
TensorFlow
Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)
TensorFlow
Real-World Robot Learning (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)
TensorFlow
Project Magenta (TensorFlow Dev Summit 2018)
TensorFlow
TensorFlow Dev Summit 2018 - Livestream
TensorFlow
Introducing TensorFlow Lite (Coding TensorFlow)
TensorFlow
TensorFlow Dev Summit 2018 Highlights
TensorFlow
Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)
TensorFlow
TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow
TensorFlow
Using TensorFlow to enable research & production across many fields (TensorFlow Meets)
TensorFlow
Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)
TensorFlow
TensorFlow Lite for Android (Coding TensorFlow)
TensorFlow
Using the tf.data API to build input pipelines (TensorFlow Meets)
TensorFlow
Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow
TensorFlow
Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)
TensorFlow
TensorFlow Lite for iOS (Coding TensorFlow)
TensorFlow
Get started with TensorFlow's High-Level APIs (Google I/O '18)
TensorFlow
TensorFlow for JavaScript (Google I/O '18)
TensorFlow
TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)
TensorFlow
Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018
TensorFlow
TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)
TensorFlow
TensorFlow Lite for mobile developers (Google I/O '18)
TensorFlow
Advances in machine learning and TensorFlow (Google I/O '18)
TensorFlow
Distributed TensorFlow training (Google I/O '18)
TensorFlow
Classification using neural networks & ML regression models #AskTensorFlow
TensorFlow
TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)
TensorFlow
Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)
TensorFlow
How to get started with AI/ML, retraining models, & more! #AskTensorFlow
TensorFlow
TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)
TensorFlow
MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)
TensorFlow
The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)
TensorFlow
At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)
TensorFlow
NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)
TensorFlow
Try TensorFlow.js in your browser (Coding TensorFlow)
TensorFlow
TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)
TensorFlow
How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)
TensorFlow
Training models faster with TensorFlow Hub (TensorFlow Meets)
TensorFlow
Prepare your dataset for machine learning (Coding TensorFlow)
TensorFlow
Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)
TensorFlow
TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)
TensorFlow
More on: ML Pipelines
View skill →Related Reads
📰
📰
📰
📰
Evolving Algorithms: Next-Generation AI in Predictive Analytics
Dev.to · Fu'ad Husnan
Architecting for the Future: A Blueprint for Model-Agnostic, Business-Ready AI
Medium · AI
The Recommender System Pipeline: An End-to-End Overview
Medium · AI
The Recommender System Pipeline: An End-to-End Overview
Medium · Machine Learning
🎓
Tutor Explanation
DeepCamp AI