Inside TensorFlow: Parameter server training

TensorFlow · Beginner ·🛠️ AI Tools & Apps ·5y ago

Skills: LLM Engineering70%ML Pipelines50%

Key Takeaways

The video demonstrates parameter server training using TensorFlow, a data-parallel method to scale up model training on multiple machines.

Full Transcript

hello everyone my name is ryofun in this talk how you and i will talk about the primary server training in test flow 2 on behalf of multiple developers working on it this talk will cover primary server training and evaluation overview single client architecture for tensorflow 2 primary server training signature join api for single client architecture in-line evaluation and available sharing after my part how you will talk more about how prime server training is implemented from runtime's perspective how workforce workflow tolerance works in a single client architecture and how we tune our performance on a synthetic recommendation model first what is prime server training in short it is a common data parallel approach to scale up a model training on multiple machines institute approach variables are stored on primary servers cured by all workers and the most of the competition happens on workers in each training step workers pull latest available values from prime servers they run for the backward process with their own training data and they send back gradients to prime servers each worker completes a step independently and they update variables on prime servers independently so ccy it is also called asynchronous training it is scalable because the workers don't have to wait for each other and also because of this it can tolerate some work of failure but there's a stale gradient problem so what is still greatest problem so let's say one worker puts variables at time a and it sends back gradients at time b but during time a and b this variable may have already been updated by other workers and if we have more cases we're going to have more stereo gradients still greater problems may hurt modern quality and the different models may have different sensitivity to this problem one ongoing research that is worth mentioning in this talk is the adaptive learning rate method is the main idea to discount the learning rate based on the stillness of gradients and the goal is to get close to synchronous training this diagram shows some preliminary results on recent sifa with 50 workers the x-axis is the number of epochs and the y-axis is the accuracy the blue curve represents the standard asynchron training algorithm while the yellow curve and the radical represent the adaptive learning rate algorithm we can see that the training which is algorithm has a more smooth accuracy curve and converges to state of the art much faster in terms of the number of epochs on the other hand synchronous primary server training is also possible it can overcome the stereo gradients problem in tensorflow 1 we implement the synchronous primary thermal training using sync replicas optimizer it creates one conditional accumulator per variable and it will not aggregate and apply gradients before it accumulates k gradients this optimizer uses a queue as a barrier to coordinate the workers but it may hang in the case of task preemption one nice thing about this optimizer is it can have uh some some backup workers so it can tolerate some worker failure but the redundant ingredients from the same step will be discarded and overall it's very complicated to implement such a sequence synchronous prime server training algorithm in testable way for the evaluation part in terms of one prime server training we use a dedicated machine to do evaluation so this dedicated machine keeps checking sharepoint written by the chief worker loads the latest checkpoint and runs evaluation against the actual point but the multi-worker evaluation is a top ask from our users the problems with this evaluator is it cannot hold large variables it cannot hold those large models and it's slow with the large evaluation data set and it's also slow with loading large share points it's difficult to use the evaluation results for early stopping on the training side and it heavily relies on checkpoints it's a waste of resources if the chief checkpoint is not often and it may miss checkpoints if the chief checkpoints too often and later may also need some special casing for deployment which has brought us some trouble in the past so in tensorflow 2 we use a different architecture to do premise level training let's first talk about the multi-current setup in tensorflow 1. so in tensorflow 1 we use a multi-client setup where each worker runs a copy of the user program and coordinates its own training each worker also creates variables and says some magic happening on prime servers to map the variables created by different workers to the same underlying storage several problems with the system multi-class setup first it's difficult to coordinate the workers as we have seen it's complicated to implement the synchronous primary server training and it's difficult to make it correct and it's also difficult to do early stopping for training based on evaluation results next it's difficult to achieve consensus among workers because there are no single source of truth let's take the tf1 prime server training for example if two workers create the same variable on different primary servers then the other problem what we want is to create the same variable on same prime servers so that all workers can share all these variables and because of which this issue there probably no errors reported to users but users may notice that their model cannot converge to good accuracy and it's very difficult to track down the issue to the inconsistency between workers the multi-client setup also has a less intuitive programming model since beginners may have a hard time understanding this mpr-like programming model it makes the deployment and testing a bit less flexible we cannot write in collab and there are even no testing for distributed estimator in tensorflow 1. if you have experience with multiple computer strategy which is multi-client as well you probably have run into hanging issues in the past most of the issues are related to the inconsistency between workers for example different workers may create different nikolaux ops in different order leading to the class that you hang in test floor 2 we recommend a single client setup where workers and the prime service wants tensorflow servers they just sit there and wait for requests from the coordinator the coordinator runs a copy of the user program it created resources on prime servers and the workers you know dispatch training work to those workers coordinates the training and it deals with this deals with the task task failure of workers and prime service there are several benefits of the single client setup first it is easy to coordinate workers as we have seen it's complicated to implement a synchronous parameter training in tensorflow 1 but we can implement an approximately single training primary server training algorithm with this apis that i will introduce later we can also easily implement early stopping with in-line evaluation which i will introduce as well it is easy to have worker consensus since there's only one source of truth with that we can avoid a lot of bugs because of inconsistencies between workers the single class setup can also allow a more intuitive programming model which i will talk about in detail in a moment it's also easy and flexible to deploy a single client cluster because we can create a bunch of tensorflow servers in events on blog and connect to the from a laptop or from collab we can also run the whole cluster within a single process which makes the testing very easy there are also problems with the single client setup which i will not avoid talking about them the coordinator may become a bottleneck when scaling up the number of workers but this is probably not a big deal for prime server training because we usually see that when we scale up the number of workers primary servers become a bottleneck first because if you think about it you will notice that a variable is also required by all workers in a single step which has very similar patterns with the coordinator there are also multiple ways to mitigate it for example you can have multiple in-flight requests to mitigate the latency issue the coordinator will become a single problem single point of failure but training cannot continue without a promise with with a prime server going down as well and it's not impossible to mitigate it another limitation is current tensorflow only supports the remote tft function it cannot strip any arbitrary python function to remote workers so let's take a look at the apis for single client architecture the most important apis are scheduled join primitives they provide a thread pull like a programming model the schedule method schedule a function to one available worker in a pool it actually just puts the function in a queue and returns a future like object immediately the joint method will block until all scheduled functions are done since scanner drawing apis are very easy for implementing load balancing between workers for tolerance and even dynamic scaling in the future it's taken four tolerance as an example if one worker gets preempted the coordinator will notice it by seeing an interrupted function be coordinated with their pull-backs function to the queue and it on another available worker the coordinator will wait for recovery of this worker in a separate thread and once his work is back the coordinator will repeal the resources on this worker and continue dispatching functions to this worker because of this full tolerance mechanism the function execution has at least the west guarantee meaning that a scheduled function will be executed at least once the schedule joint apis are contained by this cluster coordinator class it had to work in conjunction with tft distribution strategy at this moment the tf distributor strategy defines the training loop and it tells class coordinator what's the requirement for scheduling and what's a requirement for four tolerance the cluster coordinator also has a method to create resources for each worker to create a pre-worker data set method will create one data center per worker what is more important is that the coordinator will rebuild the resources and workers when workers recover from preemption the coordinator will just invoke the dsl function again to recreate the datasets and recover the workers one thing to note here is that with the kind of implementation the coordinator than the provider visitation guarantees on datasets this is because the schedule method assumes that the workers are equivalent so that it can schedule to one available worker and as a result each worker may run an indecent indeterminacy number of steps actually this credit per worker dataset method creates the same dataset on different workers except that they may be shuffled differently on different workers estimated then the provided veteran guarantees for training datasets as well so in the short term this is probably not a huge problem let's see how the apis are used here's a short example for custom training loop within real strategy we first create a strategy create all of the smaller objects under strategy scope create data set using the strategy method create a state function which contains multiple replicas then we invoke the stable function multiple times now here is the training loop with prime server strategy and a class coordinator we try to have minimum changes for users when they switch between strategies we still use a distribution strategy to define the training loop and we use a class coordinator to dispatch this training work to remote workers we first as highlighted here we first created a strategy we created a class coordinator and passed in a strategy object then we created one data cell per worker using this credit pro or design method we define the training loop in the same way with other strategies after that we use the coordinator to dispatch step function to remote workers we call join at the epoch boundaries so that we can sharepoint write summaries or printer metric values there are limitations with the current implementation of the api there's no visitation guarantee for training dataset as i mentioned and another issue is that a primary server parenting will lead to an unavailable error on the coordinator and the coordinator will catch this error and exit with a special error code to signal block to restart the coordinator without counting toward task failure after restart the coordinator will connect to those workers and the primary servers again create resources and continue dispatching functions there may also be performance issues since our initial implementation is a python best and we create a lot of threads in python there may be threading overhead and the gear overhead so we are trying to reduce this overhead by moving the implementation to the runtime furthermore if the latency between the coordinator and the workers is high we need to mitigate it by issuing multiple steps in a single rpc call in later slides how you will introduce ways to achieve multiple optimal performance even with this imitation the last limitation is it can only dispatch tft function which we may or may not relax in the future let's see the evaluation for tensorflow 2 primary server training we can still use the cycle evaluation which we have seen in tensorflow 1 prime server training but we use a dedicated machine to do evaluation in-line evaluation is made possible by our single client setup so the internet evaluation refers to run evaluation on the coordinator or wrong evaluation on the same worker pool and we can alternate between training and evaluation it's very consistent with close apis there are benefits of in-line evaluation we don't need a special environmental job we are able to use evaluation results to early stopping for training and we can evaluate large models because we have multiple machines multiple primary servers and we can it's faster to develop with large datasets also because we have multiple machines there's some limitations with inline evaluation if we use schedule join to implement a multi-work evaluation but this is another problem if we just run the variation on the coordinator so the problem with using scanning drawing for multi-walker inline evaluation is there's no reassertion guarantee as i mentioned before but the validation guarantee may be very desirable for evaluation and the possible solution here uh to either youtube or their service or we create many virtual charts from the evaluation data sets and schedule an entire virtual chart to a remote worker another limitation is that the scanner join only provides at least the worst guarantee but we would like to have it's an exactly what's guaranteed for evaluation we would like to have whether to evaluate each evaluated evaluation example only once and the possible solution is to use the together the evaluation results from the return value of functions and we only schedule functions that don't have any side effect the last but not least part is variable sharding variable sharding refers to splitting a variable into multiple shots and putting them on different primary servers variable sharing is important for load balancing between workers to achieve larger effective bandwidth of a trading cluster variables so that sharded will become charter variables which is just a container holding our shards of a variable chart available can work efficiently with tf.inventing lookup it does invading lookup on prime servers in parallel and it only needs to transfer back embedding slices to workers the grainy update can also happen in parallel and only relevant slices are touched sharpie variable can be converted to a tensor by concatenation so that it can work with most of tensorflow libraries seamlessly there are also other forms of model parallelism related to variable sharding that doesn't require concatenating all of the shots for example we can do studio matrix multiplication on prime servers and then concatenate the results on workers as opposed to concatenating all the shots on workers first then to match with multiplication here's here's some ongoing and future work the list is not exhaustive the first item is to integrate parameters of training with chaos compile fit whether to support handling primary failure without restarting the coordinator we would like to have a packed representation for charitable variables we would like to improve the performance of the coordinator we like to provide a canonical workforce for english evaluation if this becomes a popular feature request we would also like to support the synchronous primes of training at the multiple gpus per worker what is more important is that we want your help and feedback if you are interested in any of these features so next i'm going to hand it over to how you who is going to talk about how primary service training is implemented from runtime's perspective the performance and the scalability all right uh thank you your phone um i'm holly from the tensorflow runtime team uh next i'll be talking about the ps strategies implementation uh inside the tensorflow legal runtime basically how is partitioning the model to the workers and parameter servers and using the distributed multi-device functions to drive its execution i will then show some experimental results of the performance in large scale training compared with the tensorflow v1 estimator so we will also see some of the problems we've encountered and i will briefly talk about the solutions to fix those problems so one major advantage of using parameter server training is to support fault tolerance so that the jobs can run on preemptable resources at large scale so i will show some experimental results for running the jobs on preemptable clusters and finally i will briefly talk about the multi-worker testbed with setup to run performance conversions and fault tolerance tests and also its integration with the ammo compass framework first let's look at how the ps training impacts with the eager runtime so as ufo previously mentioned we designed and implement the single client parameter server chaining so here we have one client coordinates a cluster of workers and parameter server instances to perform their training user code is deployed on the coordinator to drive the model execution just like the one shown here on the bottom left corner all the workers and parameter server instances are just running the standard tensorflow server binaries so users first define their model as a tf function and then call the schedule api to dispatch the model functions to these workers and as previously mentioned the schedule api is asynchronous which means that the functions will only be queued up on the coordinator side and the schedule call will return the control back to user's code immediately here we already have several functions queued up on the coordinator side and a background thread pool will dispatch those functions to workers each function is actually a multi-device function or multi-walker function that spans one worker and multiple parameter server instances to read and update the model variables the function gets instantiated and partitioned on the workers and further executed on both the workers and parameter servers the function returns status and return values will then be sent back to the coordinator this way the ps training achieves both model and data parallelism with model partition on multiple parameter server tasks and the training performed on multiple data badges on multiple worker tasks concurrently let's further zoom into these distributed functions and see how they get partitioned and executed on different tasks here i'm showing an example of two workers and two primary servers we have a very simple model with only two variables v1 that's placed on ps0 sorry v0 plays down ps0 and v1 based on ps1 the coordinator launches a model computation function replicated on these two workers in each training step they both read the latest version of the variables and perform model computation to get the gradients and finally the gradients are updated on those variables since these variables are actually located on remote work tasks of different hosts the function will need to be instantiated and partitioned as distributed functions like it's shown in this graph so here it shows that the function on worker 0 gets partitioned as three component functions so one of them is located locally on worker 0 and others are located remotely on ps0 and ps1 respectively the local sub function on worker 0 is mainly responsible for model computation and the ones on ps are for sending and receive sending variables and aggregating the gradients the nodes in different functions are connected by send and receive obs and the arrows here in this graph indicate the data dependencies and i'm using the dashed arrows for remote data dependencies the same partitioning logic also applies to the remote function that's running on worker 1 which will be executed independently from the function that's executed on worker 0. note that here the function graphs are just for illustrative purposes are and they're not the actual uh definition of function graphs the real graphs will involve several variable read and write offs and some remote copies which we actually omitted here for simplicity next i will switch to talk more about the performance measurement of the ps strategy our performance is measured based on training a large embedding model with parameter servers so this kind of model represents a very common production use case for primary servers because the embedding model typically has a very large vocabulary size and embedding layer that cannot fit in one device they require to be partitioned among multiple primary servers to achieve model parallel the model parallelism the embedded model we tested here has two embedding layers that's a user and item in value and also then followed by uh three fully connected dense layers there are total number of more than one billion chainable parameters in this model so remember its performance with different cluster sizes and compare them with the estimator here is a preliminary result so we have the number of workers growing from 10 to 320 on the x-axis and we have the measured throughput in terms of global steps per second on the y-axis uh note that the x-axis in is actually in log scale and we also increase the number of parameter instances proportionally to keep the ratio of worker to ps at five to one according to the tfx team this ratio is a production practice by default so we first look at the plot with only two series of numbers measured from running the estimator and the vanilla ps strategy we can see that with this vanilla ps strategy deployment it does not perform as good as estimator after the cluster scales to more than 80 workers and the performance gap further increases with larger cluster sizes so this is mainly due to two reasons first of all the estimator runs the client code locally on each worker but the ps strategy actually needs to dispatch a function from a remote coordinator so the network range of time is actually exposed as part of your step time and second since the current coordinator implementation creates one python threat per worker there will be more and more python view overhead in larger clusters due to the contact switching overhead on the coordinator and this skill overhead can actually be quite large with more and more workers so what we can do is to use a technique we call a multi-step packing to mitigate this problem so in the user's customer training loop code we can pack multiple training steps in each function as opposed to just running one step per function we can see that if we just pack five steps um in each function the ps strategy can achieve the same performance as estimator for this specific model and to be more concrete the code here on the right hand side implement multi-step packing with an inner loop to pack and model training steps inside each tier function and as we saw from the previous slide using the number of n equals 5 here is enough to achieve the same performance as estimator in this specific workload and note that we're not the first one to introduce this worker uh training loop concept this is us actually also in the tpu model performance guide as a performance knob that users can tune in their model so we can further look into the performance impact of multi-step packing by inspecting the profiles we grab from the walkers so the one on the top is a training without multi-step packing and the one on the bottom is training with multi-step packing and each small block here can be roughly treated as a time uh it takes to run one function on the worker there are actually many other small ops here in the thread pool uh below but we're not showing them here due to the space constraint we can see that the gap the gap time between steps is roughly the same for both profiles so it's actually 5 to 10 milliseconds in this case this includes the network ground to time between the coordinator and the workers as well as the coordinator python overhead waiting for the deal so with multi-step packing the amortized gap time per training staff will be much less so the training throughput can be can be improved a lot and to demonstrate the improvement even further in the next set of experiments we slightly change the model to make it smaller and less computation intensive so we do so by reducing the number of dense layers from three layers to only one layer this way each model step can can be finished in a much shorter time on the workers side since we make the model smaller we also need to change the worker to ps ratio from the original five to one ratio down to two to one so that the ps can have more computation capacity to accommodate the workers so in the experiment in the performance experiment here we vary the number of workers from 8 to 256 on the x-axis and we measure the performance on the y-axis we can see that with 10 or more steps packed inside each function the performance is consistently higher than estimator and note that this model is intentionally designed to have a much larger throughput at thousands of steps per second just to amplify the multi-step packing performance improvement and typically in production models we usually see only a few hundred steps per second which is much less than the performance here so packing 5 to 10 steps in each function should be more than enough while multi-step packing can normally improve the performance there are also some subtle issues users need to be aware of when doing this first of all it actually increases the cost of worker failure recovery the basic unit for the coordinator to retry is at the function level and recall that the user schedules a lot of functions inside the coordinator asynchronously and each function can carry only one return status or return value so if the function fails the coordinator will think that all steps in that function has already failed so it needs to re-run all the steps in that function on a different worker even if some of the steps might have already finished successfully and second the coordinator has less fine granularity in controlling the training step and getting the return values and magics and finally with the custom training loop api what is that packing actually requires user call changes so it's not transparent to users but on the positive side multi-step packing effectively reduces the coordinator overhead and with yet to see a use case where multi-step packing can negatively affect the performance and in future in the future when when the ps strategy has native camera support with a compound fit api this multi-step packing can also be hidden inside the library but one thing to reiterate as we have uh mentioned before is that um multi-step packing is a solution to overcome the current limitation of our implementation and we do plan to keep improving the efficiency of the coordinator with potentially like moving some of the user space threads into the into the into the runtime so when we achieve that users will no longer need to worry about the threat contention or tuning this multi-step packing knobs next i will briefly talk about the results from large scale fault tolerance testing so from this we can see that nearly half of the brain jobs actually failed due to preemptions and this is even more likely to happen when users are running the jobs on a cluster of hundreds or thousands of preemptable machines without fault tolerance support worker failures can cause either the job to hand forever or like it will have to frequently restart from a checkpoint with a huge waste of item resources during that period of time for the tf2 parameter server training to be ready for production use cases we will need to achieve the following first of all the failure handling must be reliable and fast the field workers should not prevent other healthy workers in the cluster from making continued training programs the second requirement is that the failure recovery must be efficient so when new resources or workers join the pool the strategy and the runtime should work together to help the user's code automatically rebuild the resources and continue to run the model and the final requirement is that the model should achieve at least the same level of scalability and performance compared to the tensorflow v1 estimator-based implementation and in tensorflow 2 ps training and the eager runtime work together to handle worker failures we create multiple threads dispatching functions on remote workers and if the function returns with an unavailable arrow indicating that the remote worker has failed it will signal a background flat to update server def which updates the cluster view on all the participating tasks with a new cluster membership other workers at the same time can keep handling the scheduled function execution request from the coordinator without being interrupted and if the failure happens to the parameter server which is not shown on the side the coordinator will re-initialize the cluster and reload the model from the latest checkpoint and this way we can handle both our worker and parameter server failures so we have tested the parameter server training extensively with three goals in mind the first one is large scale according to the tfx team 200 worker is about the average size of protection jobs and 1000 worker is over uh 99th percentile of the deployed production jobs so we need to make sure that the jobs can run smoothly at both these different scales and the second goal is low priority resources so production jobs using battery resources often encounter and must tolerate both ps and walker figures the third goal is long duration we run these jobs that last four days to avoid any potential out of memory issues that might be due to the accumulation of state so in this set of experiments we run these workflows at different priorities at different scales and with different number of numbers of workers during the execution these jobs actually encounter up to uh thousands of worker preemptions and 100 plus ps preemptions and the clustering in the job were able to make continuous training progress during the entire process without handing failures or any other issues so making the training process for tolerant allows us to run large-scale jobs on preemptable resources according to the google resource pricing on borg using preemptable resources especially at batch priority can reduce 50 to 85 percent of resource cost compared to using dedicated highly available resources and thus supporting preemptive resources is actually crucial for large-scale production jobs here we take the tfx flux canary job as an example we have this table showing the resource cost reduction used in this production deployment we have columns in the middle showing the amount and the priority of resources and their normalized cost when using dedicated resources and the column on the right showing the resource and normalized cost when using the preemptable resources and for this specific job which typically lasts about four hours using preemptive resources can save google over 1000 gcu hours per round and over 60 are resource cost reduction and finally uh we will talk about uh the multi-worker testing framework that we set up so the motivation for this work is that tensorflow previously did not have an automated multi-worker testing environment on work and in the v1 estimator world we heavily relied on users or other teams such as tfx to report regressions or breakages and then we just directly jumped into the user's models to debug those problems and this was not a very productive setup so the goal of this framework is to continuously monitor the fault tolerance and performance of ps strategy on a set of standardized models and serve as the first line of defense for any potential breakages or regressions the architecture of the testing framework is like this we first set up a standard guitar workflow which gets triggered automatically to run on the schedule such as every few hours and then the guitar job can start a separate sandbox cluster on work where we actually launch our tensorflow server instances we use a job inside guitar as a coordinator and it can schedule and coordinate different model functions to run in the cluster we also run a job continuously to uh to inject failures which can simulate the block preemptions so that we can ensure the fault tolerance mechanism is functioning properly and finally the coordinator report the benchmark results to the ammo compass dashboard for us to visualize and monitor the results currently this multi-worker testing environment runs the amnest resnet and recommendation models continuously on small to medium cluster medium scale clusters so here is a sample plotting for some recent statistics of job completion time with and without introducing these failures we will also be keeping on working on expanding the test scope to include uh more machine learning models uh running on large scale clusters and low priority resources all right uh so we went through all the content uh to summarize in this talk we introduced the parameter server strategy which was recently made publicly available in tensorflow 2.4 release we discussed the design decisions of using a single client approach and explained the programming model with a schedule joint non-blocking apis and talked about the process of training and validating the model and using the shoddy variables with the ps strategy we then further explained how the tf2 eagle runtime is backing up the ps strategy with distributed multi-device function execution we discussed the performance issues we encountered and demonstrated that we can match or exceed estimators performance with the new pia strategy we also did a lot of scalability and failure photons stress testing to make sure that the ps strategy can run at large scale with low priority resources and we also developed this multi-worker testing framework to ensure the healthiness of our product so with that uh thank you for your attention and we're happy to take any [Music] questions you

Original Description

In this episode of Inside TensorFlow, Software Engineers Yuefeng Zhou and Haoyu Zhang demonstrate parameter server training. Parameter server training is a common data-parallel method to scale up model training on multiple machines. Parameter server training tutorial → http://goo.gle/2LtWQ98 Watch more Inside TensorFlow → https://goo.gle/Inside-TensorFlow Subscribe to the TensorFlow channel → https://goo.gle/TensorFlow #InsideTensorFlow

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from TensorFlow · TensorFlow · 0 of 60

← Previous Next →

The TensorFlow YouTube Channel is Here!

The TensorFlow YouTube Channel is Here!

Answering Your TF Questions #AskTensorFlow

Answering Your TF Questions #AskTensorFlow

Chatting With the TensorFlow Community (TensorFlow Meets)

Chatting With the TensorFlow Community (TensorFlow Meets)

All About TensorFlow Code (Coding TensorFlow)

All About TensorFlow Code (Coding TensorFlow)

TensorFlow: an ML platform for solving impactful and challenging problems

TensorFlow: an ML platform for solving impactful and challenging problems

Keynote (TensorFlow Dev Summit 2018)

Keynote (TensorFlow Dev Summit 2018)

tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)

tf.data: Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)

Eager Execution (TensorFlow Dev Summit 2018)

Eager Execution (TensorFlow Dev Summit 2018)

Machine Learning in JavaScript (TensorFlow Dev Summit 2018)

Machine Learning in JavaScript (TensorFlow Dev Summit 2018)

Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)

Training Performance: A user’s guide to converge faster (TensorFlow Dev Summit 2018)

The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)

The Practitioner's Guide with TF High Level APIs (TensorFlow Dev Summit 2018)

Distributed TensorFlow (TensorFlow Dev Summit 2018)

Distributed TensorFlow (TensorFlow Dev Summit 2018)

Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)

Debugging TensorFlow with TensorBoard plugins (TensorFlow Dev Summit 2018)

TensorFlow Lite (TensorFlow Dev Summit 2018)

TensorFlow Lite (TensorFlow Dev Summit 2018)

Searching Over Ideas (TensorFlow Dev Summit 2018)

Searching Over Ideas (TensorFlow Dev Summit 2018)

Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)

Reconstructing Fusion Plasmas (TensorFlow Dev Summit 2018)

Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)

Nucleus: TensorFlow toolkit for Genomics (TensorFlow Dev Summit 2018)

Open Source Collaboration (TensorFlow Dev Summit 2018)

Open Source Collaboration (TensorFlow Dev Summit 2018)

Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)

Swift for TensorFlow - TFiwS (TensorFlow Dev Summit 2018)

TensorFlow Hub (TensorFlow Dev Summit 2018)

TensorFlow Hub (TensorFlow Dev Summit 2018)

Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)

Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018)

Real-World Robot Learning (TensorFlow Dev Summit 2018)

Real-World Robot Learning (TensorFlow Dev Summit 2018)

TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)

TensorFlow Extended (TFX) (TensorFlow Dev Summit 2018)

Project Magenta (TensorFlow Dev Summit 2018)

Project Magenta (TensorFlow Dev Summit 2018)

TensorFlow Dev Summit 2018 - Livestream

TensorFlow Dev Summit 2018 - Livestream

Introducing TensorFlow Lite (Coding TensorFlow)

Introducing TensorFlow Lite (Coding TensorFlow)

TensorFlow Dev Summit 2018 Highlights

TensorFlow Dev Summit 2018 Highlights

Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)

Jeff Dean, Head of AI at Google discusses the impact of ML (TensorFlow Meets)

TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow

TensorFlow Mobile vs. TF Lite and More! #AskTensorFlow

Using TensorFlow to enable research & production across many fields (TensorFlow Meets)

Using TensorFlow to enable research & production across many fields (TensorFlow Meets)

Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)

Teaching TensorFlow for Deep Learning at Stanford University (TensorFlow Meets)

TensorFlow Lite for Android (Coding TensorFlow)

TensorFlow Lite for Android (Coding TensorFlow)

Using the tf.data API to build input pipelines (TensorFlow Meets)

Using the tf.data API to build input pipelines (TensorFlow Meets)

Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow

Training Models in the Cloud & the Benefits of AI Toolkits #AskTensorFlow

Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)

Execute operations immediately with TensorFlow's Eager Execution (TensorFlow Meets)

TensorFlow Lite for iOS (Coding TensorFlow)

TensorFlow Lite for iOS (Coding TensorFlow)

Get started with TensorFlow's High-Level APIs (Google I/O '18)

Get started with TensorFlow's High-Level APIs (Google I/O '18)

TensorFlow for JavaScript (Google I/O '18)

TensorFlow for JavaScript (Google I/O '18)

TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)

TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)

Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018

Get started with TensorFlow's High-Level APIs in 5 mins | Google I/O 2018

TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)

TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)

TensorFlow Lite for mobile developers (Google I/O '18)

TensorFlow Lite for mobile developers (Google I/O '18)

Advances in machine learning and TensorFlow (Google I/O '18)

Advances in machine learning and TensorFlow (Google I/O '18)

Distributed TensorFlow training (Google I/O '18)

Distributed TensorFlow training (Google I/O '18)

Classification using neural networks & ML regression models #AskTensorFlow

Classification using neural networks & ML regression models #AskTensorFlow

TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)

TensorFlow and Keras in R - Josh Gordon meets with J.J. Allaire (TensorFlow Meets)

Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)

Focus on your experiment with TensorFlow Estimators (TensorFlow Meets)

How to get started with AI/ML, retraining models, & more! #AskTensorFlow

How to get started with AI/ML, retraining models, & more! #AskTensorFlow

TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)

TensorFlow - the deep learning solution for mobile platforms (TensorFlow Meets)

MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)

MiniGo: TensorFlow Meets Andrew Jackson (TensorFlow Meets)

The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)

The growth of TensorFlow with added support for JS & Swift (TensorFlow Meets)

At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)

At the intersection of TensorFlow & nuclear physics (TensorFlow Meets)

NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)

NVidia TensorRT: high-performance deep learning inference accelerator (TensorFlow Meets)

Try TensorFlow.js in your browser (Coding TensorFlow)

Try TensorFlow.js in your browser (Coding TensorFlow)

TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)

TensorFlow Hub: reusing machine learning modules (TensorFlow Meets)

How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)

How to use TensorFlow in PyCharm (TensorFlow Tip of the Week)

Training models faster with TensorFlow Hub (TensorFlow Meets)

Training models faster with TensorFlow Hub (TensorFlow Meets)

Prepare your dataset for machine learning (Coding TensorFlow)

Prepare your dataset for machine learning (Coding TensorFlow)

Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)

Using ML to predict insulin use for Type 1 Diabetes (TensorFlow Meets)

TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)

TFX: an end-to-end machine learning platform for TensorFlow (TensorFlow Meets)

This video tutorial demonstrates how to use parameter server training in TensorFlow to scale up model training on multiple machines. It covers the basics of data-parallel methods and how to implement them using TensorFlow.

Key Takeaways

Set up a TensorFlow environment
Define a model architecture
Configure parameter server training
Train a model using parameter server training
Evaluate model performance

💡 Parameter server training is a effective way to scale up model training on multiple machines, allowing for faster training times and larger model sizes.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related Reads

The AI Tools You’re Paying For Have Free Twins Nobody Mentions

Many AI tools have free alternatives with similar functionality, learn how to identify and utilize them to optimize your budget

I Accidentally Stumbled Into an AI Side Hustle — Here's Everything I've Learned

Learn how to monetize AI tools by reselling AI API access, a legit business model that doesn't require coding skills

How I Made $700 in 2 Weeks Selling AI Automation Templates

Learn how to create and sell AI automation templates to generate passive income, with a real-life example of making $700 in 2 weeks

MCP Server Tutorial: Build Your Own AI Tools in 30 Minutes

Learn to build a custom MCP server with AI tools in 30 minutes using TypeScript and NeuroLink SDK

Dev.to · NeuroLink AI

How AI Is Transforming Analytics in Tableau Cloud & Server

Salesforce Product Center