profile
viewpoint
Yuefeng Zhou yuefengz Google Mountain View, CA

sql-machine-learning/elasticdl 475

Kubernetes-native Deep Learning Framework

yuefengz/benchmarks 1

Benchmark code

alalei/NightyBird_Server 0

Server program for NightyBird App

yuefengz/community 0

Stores documents used by the TensorFlow developer community

yuefengz/ecosystem 0

Integration of TensorFlow with other open-source frameworks

yuefengz/models 0

Models and examples built with TensorFlow

yuefengz/tensorflow 0

An Open Source Machine Learning Framework for Everyone

pull request commenttensorflow/tensorflow

Try to make _global_policy thread local

I am thinking about to try to not expose new user interfaces but I don't know deeply the complete flow of these internals. Is there a way to identify that we are inside a MirroredStrategy thread?

If tf.distribute.get_replica_context() is None, then we are inside a MirroredStrategy thread.

bhack

comment created time in 13 days

pull request commenttensorflow/tensorflow

Try to make _global_policy thread local

tf.distribute.register_thread_local(thread_local) would introduce some global states to tf.distribute, which might be fine given mixed precision is global as well.

What is the problem with setting thread local in a replica_fn? Could adding an argument to run as an option?

bhack

comment created time in 19 days

pull request commenttensorflow/tensorflow

Try to make _global_policy thread local

The main issue with making this thread local is that MirroredStrategy spawns multiple threads. If the policy is thread local and the policy is updated, the new threads won't see the new values of the policy. If you run the test //tensorflow/python/keras/mixed_precision/experimental:keras_test, it should fail due to this issue.

TensorFlow has lots of thread local variables. Normally to get around this, this class copies over the values of the thread local variables from the old thread to the new thread. Unfortunately, we cannot have it copy the value of the Keras policy, since we cannot introduce a new dependency from TensorFlow to Keras.

@guptapriya, @yuefengz, can we introduce an API to allow new thread-local variables to be copied by distribution strategy? Alternatively, perhaps the policy should be kept as a global variable instead of a thread local variable.

What API are thinking of? If this would be user-visible change anyway, is it acceptable to set the thread_local object in a replica_fn?

bhack

comment created time in 19 days

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate+* Run discovery and add any new nodes up to max number of nodes+* Blacklist nodes that did not ping chief (or next candidate)+* broadcast new cluster spec+* if number of nodes doesn’t meet minimum threshold, restart membership process++### _Update Cluster_++The update cluster process happens when a node receives a new cluster spec. This cluster spec needs to modify the worker node to discover or blacklist nodes. Currently TensorFlow doesn’t allow for dynamic cluster sizes. There are several identified things that need to be changed in order to allow for dynamic sizes. ++* Collective algorithms will need to be dynamic+    * There are a node def attributes that need to become inputs that can be set at runtime+        * **group_size** and **group_key** will need to be Inputs

Anyway, changing the collective ops is acceptable if rebuilding the graph turns to be really expensive. My point was that you can start building a quick prototype on top of the existing MWM. This quick prototype won't require changing collective ops if you rebuild the graphs. This prototype would allow us to know better about the potential problems and necessary changes. This also allows potential users to try it out earlier.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate+* Run discovery and add any new nodes up to max number of nodes+* Blacklist nodes that did not ping chief (or next candidate)+* broadcast new cluster spec+* if number of nodes doesn’t meet minimum threshold, restart membership process++### _Update Cluster_++The update cluster process happens when a node receives a new cluster spec. This cluster spec needs to modify the worker node to discover or blacklist nodes. Currently TensorFlow doesn’t allow for dynamic cluster sizes. There are several identified things that need to be changed in order to allow for dynamic sizes. ++* Collective algorithms will need to be dynamic+    * There are a node def attributes that need to become inputs that can be set at runtime+        * **group_size** and **group_key** will need to be Inputs

Whether to build your variables depends on how you implement you solution I think. But I think it is simpler to rebuild variables. According to how you instructment it, it seems to me that your step 0 contains dataset creation/initialization time as well, which could be quite expensive. Rebuilding the dataset in case of worker pool change is necessary if you want to reshard your dataset.

Here we are talking about the necessity to add group_size and group_key to collective ops. In my opinion, simply recreating the graph should be good enough. So we can just see how expensive it is to recreate the graph.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate+* Run discovery and add any new nodes up to max number of nodes+* Blacklist nodes that did not ping chief (or next candidate)+* broadcast new cluster spec+* if number of nodes doesn’t meet minimum threshold, restart membership process++### _Update Cluster_++The update cluster process happens when a node receives a new cluster spec. This cluster spec needs to modify the worker node to discover or blacklist nodes. Currently TensorFlow doesn’t allow for dynamic cluster sizes. There are several identified things that need to be changed in order to allow for dynamic sizes. ++* Collective algorithms will need to be dynamic+    * There are a node def attributes that need to become inputs that can be set at runtime+        * **group_size** and **group_key** will need to be Inputs

Does step 1 here include gpu initialization, variable creation and dataset creation time? I think you can just measure the time of function tracing to measure cost of rebuilding graphs.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate

Re consensus algorithm, I am not sure whether it is a good an idea to require users to configure these services. Another option is similar to the single-client PS design which uses a single coordinator for the whole training cluster. But it assumes the client won't fail, otherwise distributed consensus is still needed. These elastic resources (e.g. elastic dataset) may be better fit in the client layer (we'll separate the client and the strategy class).

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate

I think most of the changes can be in a separate repository (we've been thinking of creating a new repo). It is reasonable if some changes are needed in the TF core. But you can start with a prototype or a demo using monkey-patch, linking new kernels etc.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate+* Run discovery and add any new nodes up to max number of nodes+* Blacklist nodes that did not ping chief (or next candidate)+* broadcast new cluster spec+* if number of nodes doesn’t meet minimum threshold, restart membership process++### _Update Cluster_++The update cluster process happens when a node receives a new cluster spec. This cluster spec needs to modify the worker node to discover or blacklist nodes. Currently TensorFlow doesn’t allow for dynamic cluster sizes. There are several identified things that need to be changed in order to allow for dynamic sizes. ++* Collective algorithms will need to be dynamic+    * There are a node def attributes that need to become inputs that can be set at runtime+        * **group_size** and **group_key** will need to be Inputs

We've optimized graph building time and it should be pretty fast these days.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate+* Run discovery and add any new nodes up to max number of nodes+* Blacklist nodes that did not ping chief (or next candidate)+* broadcast new cluster spec+* if number of nodes doesn’t meet minimum threshold, restart membership process++### _Update Cluster_++The update cluster process happens when a node receives a new cluster spec. This cluster spec needs to modify the worker node to discover or blacklist nodes. Currently TensorFlow doesn’t allow for dynamic cluster sizes. There are several identified things that need to be changed in order to allow for dynamic sizes. ++* Collective algorithms will need to be dynamic+    * There are a node def attributes that need to become inputs that can be set at runtime+        * **group_size** and **group_key** will need to be Inputs

You can just rebuild the graph/function without needing to passing new group sizes or keys.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate

It seems difficult for all workers to agree on the failure state of the chief without any coordination.

sboshin

comment created time in a month

issue commenttensorflow/tensorflow

[Feature Request] Dynamic RPC Address Resolution

@byronyi Launching jobs on multiple remote GPU workers is not supported. When we support it, probably via single-client architecture, checkpoints can be made periodically by the client. If any worker fails, there is already some mechanism to update the configuration of the cluster, clear eager context, rebuild resources, load from checkpoint etc. but there are other components missing to make remote task dispatching work properly. We will considering supporting that. cc @crccw

make1980

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   Minimize changing on Tensorflow core and try to use native components.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of **distributed strategy**(including parameters server.)+++### Detailed Design and Implementation++The trainable warp class `resource_variable_ops.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space(**To Be Discussed**)+    *  USER API: `tf.dynamic_embedding`+    *  Tensorflow Core: `from tensorflow.xx.xx import dynamic_embedding_ops`  +    +    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.resource_variable_ops.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.+Maybe the `tf.nn` or `tf.keras` would be better choices.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export("dynamic_embedding.Variable")+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the weights of sparse models such as DLRM.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Embedding Lookup++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse` in funcion and input arguments.++```python+@tf_export("dynamic_embedding.embedding_lookup")+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `resource_variable_ops.TrainableWrap` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export("dynamic_embedding.embedding_lookup_sparse")+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `resource_variable_ops.TrainableWrap` which hold the handles of `params` and `ids`.+  """+  pass+```++### Trainable Wrap++In the early scheme, the `dynamic_embedding.Variable` will directly be trained by optimizers and **we had to extand all optimizers one-by-one**.++Now we propose a new scheme which no longer require to extand all optimizers one by one: +We design a warp class `resource_variable_ops.TrainableWrap` which inherts from `resource_variable_ops.ResourceVariable`, and responsible for:++- Maintain the relationship between `params` and `ids` created by `embedding_lookup{, _sparse}`.+- Look up the newest values from `dynamic_embedding.Variable` and update them to memory held by `TrainableWrap->tensor()` before each iteration start.+++#### python/ops/resource_variable_ops.py:++```python++class TrainableWrap(ResourceVariable):

My understanding of TrainableWrap is a container to hold (dynamic_embedding, ids) but pretends to be a tf.Variable. I am wondering whether it has to a ResourceVariable?

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design

What does the gradient look like? Do you need to define custom gradient function for embedding_lookup?

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   Minimize changing on Tensorflow core and try to use native components.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of **distributed strategy**(including parameters server.)+++### Detailed Design and Implementation++The trainable warp class `resource_variable_ops.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space(**To Be Discussed**)+    *  USER API: `tf.dynamic_embedding`+    *  Tensorflow Core: `from tensorflow.xx.xx import dynamic_embedding_ops`  +    +    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.resource_variable_ops.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.+Maybe the `tf.nn` or `tf.keras` would be better choices.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export("dynamic_embedding.Variable")+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the weights of sparse models such as DLRM.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):

Could you add some details for when the remove would be triggered?

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   Minimize changing on Tensorflow core and try to use native components.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of **distributed strategy**(including parameters server.)+++### Detailed Design and Implementation++The trainable warp class `resource_variable_ops.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space(**To Be Discussed**)+    *  USER API: `tf.dynamic_embedding`+    *  Tensorflow Core: `from tensorflow.xx.xx import dynamic_embedding_ops`  +    +    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.resource_variable_ops.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.+Maybe the `tf.nn` or `tf.keras` would be better choices.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export("dynamic_embedding.Variable")+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the weights of sparse models such as DLRM.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Embedding Lookup++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse` in funcion and input arguments.++```python+@tf_export("dynamic_embedding.embedding_lookup")+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `resource_variable_ops.TrainableWrap` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export("dynamic_embedding.embedding_lookup_sparse")+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `resource_variable_ops.TrainableWrap` which hold the handles of `params` and `ids`.+  """+  pass+```++### Trainable Wrap++In the early scheme, the `dynamic_embedding.Variable` will directly be trained by optimizers and **we had to extand all optimizers one-by-one**.++Now we propose a new scheme which no longer require to extand all optimizers one by one: +We design a warp class `resource_variable_ops.TrainableWrap` which inherts from `resource_variable_ops.ResourceVariable`, and responsible for:++- Maintain the relationship between `params` and `ids` created by `embedding_lookup{, _sparse}`.+- Look up the newest values from `dynamic_embedding.Variable` and update them to memory held by `TrainableWrap->tensor()` before each iteration start.+++#### python/ops/resource_variable_ops.py:++```python++class TrainableWrap(ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):+      with ops.control_dependencies([+          gen_resource_variable_ops.assign_variable_op(+            self._handle, self.prefetch_values,  +          name="AssignBeforeReadVariable")]):+        result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                            self._dtype)+    else:+      result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                          self._dtype)+    _maybe_set_handle_data(self._dtype, self._handle, result)+++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_{dense,sparse_duplicate_indices}}` API to train `resource_variable_ops.TrainableWrap` naturally after changing a little common code in `training/optimizer.py`:++- Create the slots of `TrainableWrap` when primary `Variable` is an instance of `TrainableWrap`.+- Write back the trained values of primary and slots to hash tables(`dynamic_embedding.Variable`) when `Optimizer._resource_apply_{dense,sparse_duplicate_indices}` finished.++#### The pseudo-code is below: ++```python+@tf_export("dynamic_embedding.create_slots")+def create_slots(primary,+                 init,+                 slot_name,+                 op_name):+  """Helper function for creating a slot variable for statefull optimizers."""+  _params_var, _params_ids = _IDS_OF_RESOURCEVARIABLE_MAPPER_.get(primary)+  with variable_scope.variable_scope(None, _params_var.name + "/" + op_name):+    _slot_variable = Variable(key_dtype=_params_var.key_dtype,+                              value_dtype=_params_var.value_dtype,+                              dim=_params_var.dim,+                              devices=_params_var.devices,+                              partitioner=_params_var.partition_fn,+                              shared_name="shared-" + slot_name,+                              name=slot_name,+                              initial_value=init,+                              initial_mode="constant",+                              trainable=False,+                              checkpoint=_params_var.checkpoint)+    slot = None+    slot = embedding_lookup(params=_slot_variable,+                            ids=_params_ids)+    if _SLOT_MAP_.get(primary) is None:+      _SLOT_MAP_[primary] = [slot]+    else:+      _SLOT_MAP_[primary].append(slot)++  return slot+++@tf_export("dynamic_embedding.get_slots")+def get_slots(primary):+  return _SLOT_MAP_.get(primary) or []++```+++```python++class _DenseDynamicEmbeddingTrainableProcessor(_OptimizableVariable):

Does this class need to be different for different optimizers?

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   Minimize changing on Tensorflow core and try to use native components.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of **distributed strategy**(including parameters server.)+++### Detailed Design and Implementation++The trainable warp class `resource_variable_ops.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space(**To Be Discussed**)+    *  USER API: `tf.dynamic_embedding`+    *  Tensorflow Core: `from tensorflow.xx.xx import dynamic_embedding_ops`  +    +    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.resource_variable_ops.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.+Maybe the `tf.nn` or `tf.keras` would be better choices.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export("dynamic_embedding.Variable")+class Variable(object):

We probably can use a different name instead of Variable?

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_xxx` API to train `dynamic_embdding.TrainableWrap` naturally after some changing a little common code:++- `ops/resource_variable_ops.py`: looking up the newest values to memory hold by `TrainableWrap->tensor()` before starting each iteration steps.+- `training/optimizer.py`: creating the slots of `TrainableWarp` when primary `Variable` is an instance of `TrainableWarp`.+- `training/optimizer.py`: writing back the trained weights of primary and slots to  hash tables of `dynamic_embedding.Variable` and after `Optimizer._resource_apply_xxx`.++The pseudo-code is below: ++```python+# ops/resource_variable_ops.py:++class BaseResourceVariable(variables.VariableV1):+  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):

Would this if make it not thread-safe?

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   Minimize changing on Tensorflow core and try to use native components.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of **distributed strategy**(including parameters server.)+++### Detailed Design and Implementation++The trainable warp class `resource_variable_ops.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space(**To Be Discussed**)+    *  USER API: `tf.dynamic_embedding`+    *  Tensorflow Core: `from tensorflow.xx.xx import dynamic_embedding_ops`  +    +    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.resource_variable_ops.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.+Maybe the `tf.nn` or `tf.keras` would be better choices.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export("dynamic_embedding.Variable")+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the weights of sparse models such as DLRM.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Embedding Lookup++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse` in funcion and input arguments.++```python+@tf_export("dynamic_embedding.embedding_lookup")+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `resource_variable_ops.TrainableWrap` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export("dynamic_embedding.embedding_lookup_sparse")+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `resource_variable_ops.TrainableWrap` which hold the handles of `params` and `ids`.+  """+  pass+```++### Trainable Wrap++In the early scheme, the `dynamic_embedding.Variable` will directly be trained by optimizers and **we had to extand all optimizers one-by-one**.++Now we propose a new scheme which no longer require to extand all optimizers one by one: +We design a warp class `resource_variable_ops.TrainableWrap` which inherts from `resource_variable_ops.ResourceVariable`, and responsible for:++- Maintain the relationship between `params` and `ids` created by `embedding_lookup{, _sparse}`.+- Look up the newest values from `dynamic_embedding.Variable` and update them to memory held by `TrainableWrap->tensor()` before each iteration start.+++#### python/ops/resource_variable_ops.py:++```python++class TrainableWrap(ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):+      with ops.control_dependencies([+          gen_resource_variable_ops.assign_variable_op(+            self._handle, self.prefetch_values,  +          name="AssignBeforeReadVariable")]):+        result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                            self._dtype)+    else:+      result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                          self._dtype)+    _maybe_set_handle_data(self._dtype, self._handle, result)+++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_{dense,sparse_duplicate_indices}}` API to train `resource_variable_ops.TrainableWrap` naturally after changing a little common code in `training/optimizer.py`:++- Create the slots of `TrainableWrap` when primary `Variable` is an instance of `TrainableWrap`.+- Write back the trained values of primary and slots to hash tables(`dynamic_embedding.Variable`) when `Optimizer._resource_apply_{dense,sparse_duplicate_indices}` finished.++#### The pseudo-code is below: ++```python+@tf_export("dynamic_embedding.create_slots")+def create_slots(primary,+                 init,+                 slot_name,+                 op_name):+  """Helper function for creating a slot variable for statefull optimizers."""+  _params_var, _params_ids = _IDS_OF_RESOURCEVARIABLE_MAPPER_.get(primary)

What is _IDS_OF_RESOURCEVARIABLE_MAPPER_?

It looks like your slot variable is created for every step since it depends on the ids (while there is not such requirement in the original optimizers). Is that correct?

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,

I'd prefer to put this in a separate repo first. We can allow it to graduate into tf core if it has large number of users.

rhdong

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   Minimize changing on Tensorflow core and try to use native components.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of **distributed strategy**(including parameters server.)+++### Detailed Design and Implementation++The trainable warp class `resource_variable_ops.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space(**To Be Discussed**)+    *  USER API: `tf.dynamic_embedding`+    *  Tensorflow Core: `from tensorflow.xx.xx import dynamic_embedding_ops`  +    +    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.resource_variable_ops.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.+Maybe the `tf.nn` or `tf.keras` would be better choices.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export("dynamic_embedding.Variable")+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the weights of sparse models such as DLRM.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Embedding Lookup++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse` in funcion and input arguments.++```python+@tf_export("dynamic_embedding.embedding_lookup")+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `resource_variable_ops.TrainableWrap` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:

Why TrainableWrap is not used in this branch?

rhdong

comment created time in a month

issue commenttensorflow/tensorflow

Distributed Training on multiple nodes - process gets stuck after initializing grpc channel

Did you do something with your cluster_resolver? Maybe you triggered something that initialized some global state. Note you should never call experimental_connect_to_cluster for MultiWorkerMirroredStrategy.

Could you show us the code how you construct your cluster_resolver and how you pass it to MultiWorkerMirroredStrategy?

hroetsc

comment created time in 2 months

issue commenttensorflow/tensorflow

Distributed Training on multiple nodes - process gets stuck after initializing grpc channel

MultiWorkerMirroredStrategy also accepts ClusterResolver as input. You can pass your resolver into it instead of using experimental_connect_to_cluster. Note the ClusterResolver passed in must have task_type and task_index set.

hroetsc

comment created time in 2 months

issue commenttensorflow/tensorflow

tensorflowrun for distributed training (MultiWorkerMirroredStrategy & ParameterServerStrategy)

It is a simple utility which I guess users can create by themselves.

sboshin

comment created time in 2 months

issue commenttensorflow/tensorflow

Distributed Training on multiple nodes - process gets stuck after initializing grpc channel

experimental_connect_to_cluster is not necessary for MultiWorkerMirroredStrategy. What are the issues to set TF_CONFIG?

hroetsc

comment created time in 2 months

issue closedtensorflow/tensorflow

Is it normal that every worker print the same loss and metric values when using multiworker distributed stragety?

Descriptions

I've created a Keras model, and trained it with multiworker distributed strategy using tf2.2.0.

The training data I used is dataset gengerated from numpy array. And according to the shard policy, the dataset will be auto sharded by DATA policy. That is to say every worker will handle different portion of the whole dataset.

After training, I notice that all the workers print the same loss and metric values just like they all trained with the identical data. But according to my knowledge, if every worker train with different data, the loss and metric values of them should be different. Is it because the shard policy didn't take effect?

I also know that multi-worker uses all-reduce communication method to keep variables in sync. So, the values printed are calculated after all-reduce? This could explain the same values maybe.

Questions:

  1. Is this normal? And why?
  2. How to check(or make sure) that different workers use different data portion for training(or the shard policy take effect)?
  3. Are the values printed for all workers after all-reduce operation or just for that single worker?
  4. When will all-reduce execute, on every batch end or epoch end?
  5. When I use just numpy array for training, does the shard policy work? Or every worker will use the whole data for its training?
  6. In general, dataset for every worker's fit method is same. When different, training will success as well, does shard policy work at this time?

Code snippets:

class ThreeLayerMLP(keras.Model):
    def __init__(self, name=None):
        super().__init__(name=name)
        self.dense_1 = layers.Dense(64, activation='relu', name='dense_1')
        self.dense_2 = layers.Dense(64, activation='relu', name='dense_2')
        self.pred_layer = layers.Dense(10, name='predictions')

    def call(self, inputs):
        x = self.dense_1(inputs)
        x = self.dense_2(x)
        return self.pred_layer(x)


def main(argv):
    del argv  # Unused args
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    BATCH_SIZE_PER_REPLICA = 64
    BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
    print('Number of devices: %d' % strategy.num_replicas_in_sync)

    with strategy.scope():
        model = model = ThreeLayerMLP(name='3_layer_mlp')
        model.compile(
            loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            optimizer=keras.optimizers.RMSprop())

    log_dir = FLAGS.logs
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir,
                                                          histogram_freq=1,
                                                          update_freq='batch')

    np.random.seed(0)
    x_train, y_train = (np.random.random(
        (60000, 784)), np.random.randint(10, size=(60000, 1)))
    x_test, y_test = (np.random.random(
        (10000, 784)), np.random.randint(10, size=(10000, 1)))

    # options = tf.data.Options()
    # options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
    train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
    train_dataset = train_dataset.shuffle(1024).batch(BATCH_SIZE)


    model.fit(
        train_dataset,
        epochs=5,
        steps_per_epoch=10,
        callbacks=tensorboard_callback)

    model_dir = FLAGS.logs + '/models/' + str(task_index)
    model.save(model_dir)


if __name__ == '__main__':
    app.run(main)

I'm cofused by these questions for long time. I will appreciate your help very much.

Thanks~

closed time in 2 months

AlexanderJLiu

issue commenttensorflow/tensorflow

Is it normal that every worker print the same loss and metric values when using multiworker distributed stragety?

This is expected. The metric values would be allreduced in fit method. By default, MultiWorkerMirroredStrategy will shard your data across workers. You can use auto_shard_policy to turn off that behavior.

AlexanderJLiu

comment created time in 2 months

pull request commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

I think TensorFlow can provide a way to extend optimizers so that you can extend existing optimizers to handle your sparse weights.

rhdong

comment created time in 2 months

pull request commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

@byronyi If we are going to contribute to addon first, do we need a RFC here?

rhdong

comment created time in 2 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging

Some suggestions but up to you: would be nicer to break lines at column 80: http://google.github.io/styleguide/pyguide.html#32-line-length docstring of function arguments should follow this format: http://google.github.io/styleguide/pyguide.html#doc-function-args

sarthfrey

comment created time in 2 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+"""+Copyright 2016 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+"""++import logging+import math+import os+import random+import re+import sys++from pyspark.sql import SparkSession+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """++    def __init__(self,+                 *,+                 num_slots,+                 local_mode=False,+                 use_gpu=True,+                 gpu_resource_name='gpu',+                 use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in+            distributed training. For example, if num_slots = 16 we train on the Spark+            cluster with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing+            CPU training. num_slots cannot be less than or equal to 0. Note that when doing+            CPU training, Spark will still be subject to any GPU-aware scheduling confs set+            in the Spark configuration. Note also that for GPU training, num_slots will+            limit the number of GPUs used for training even if more are available, so that+            exactly num_slots GPUs are used in total. Spark does not restrict CPU cores for+            tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and in local mode set num_slots=1.+        :param local_mode: If True, the training function will be run locally on the+            driver. If False training is distributed among the workers.+        :param use_gpu: If True, training is done with GPUs using Spark resource scheduling+            with the gpu_resource_name parameter as the resource name. If False, do CPU only+            training.+        :param gpu_resource_name: The name of the Spark resource scheduling GPU resource. It+            may be set under `spark.executor.resource.{gpu_resource_name}`,+            `spark.task.resource.{gpu_resource_name}`,+            `spark.driver.resource.{gpu_resource_name}`, and+            `spark.worker.resource.{gpu_resource_name}` in the Spark conf. Contact the cluster+            administrator to set these configurations. The resource should be configured with+            a discovery script that is formatted according to the Spark configuration docs.+            Make sure `spark.driver.resource.{gpu_resource_name}.discoveryScript` and+            `spark.driver.resource.{gpu_resource_name}.discoveryScript` are also set in the+            Spark conf. In particular, the GPU addresses should be zero indexed. For example,+            the output of the discovery script for 3 GPUs with gpu_resource_name='gpu' would be+            `{"name": "gpu", "addresses":["0","1","2"]}`. See an example discovery script:+            `github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh`.+        :param use_custom_strategy: When true, the training function passed to the+            MirroredStrategyRunner.run method must construct and use its own+            tensorflow.distribute.Strategy() object. When false, MirroredStrategyRunner+            constructs one for the user and wraps the training function in the strategy+            context, allowing the user to provide non-distributed TensorFlow code that is+            executed as distributed code.++            Example with use_custom_strategy=True:++                def train_fn():+                    import tensorflow as tf+                    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()+                    with strategy.scope():+                        # training code++            Example with use_custom_strategy=False:++                def train_fn():+                    import tensorflow as tf+                    # training code+        """+        self._logger = _get_logger(self.__class__.__name__)+        self._num_slots = num_slots+        if num_slots <= 0:+            raise ValueError(+                f'num_slots is set to {num_slots} but cannot be less than or equal to 0.'+            )+        self._local_mode = local_mode+        self._use_gpu = use_gpu+        self._gpu_resource_name = gpu_resource_name+        self._use_custom_strategy = use_custom_strategy+        if self._use_gpu:+            self._logger.info('Doing GPU training...')+        else:+            self._logger.info('Doing CPU training...')+        spark = SparkSession.builder.getOrCreate()+        self.sc = spark.sparkContext+        if self._local_mode is True:+            self._logger.warning(+                'MirroredStrategyRunner will run in local mode on the driver node. '+                'There would be resource contention if the driver also runs other workloads.'+            )+            self._num_tasks = None+        else:+            self._num_tasks = self.get_num_tasks()+            self._logger.info(f'Will run with {self._num_tasks} Spark tasks.')++    def get_num_tasks(self):+        """+        :return: The number of Spark tasks to use for distributed training+        """+        if self._use_gpu:+            key = f'spark.task.resource.{self._gpu_resource_name}.amount'+            if not self.sc.getConf().contains(key):+                raise Exception(+                    'Your cluster might not have Spark GPU-aware scheduling enabled, '+                    'please contact your cluster administrator.'+                    f'The conf `{key}` was not found in the Spark configuration.'+                )+            task_gpu_amount = int(self.sc.getConf().get(key))

Maybe you can call _get_gpus_owned here? Looks like there are two different ways to get the number of GPUs. Are they different? If not, could you consolidate them?

sarthfrey

comment created time in 2 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+"""+Copyright 2016 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+"""++import logging+import math+import os+import random+import re+import sys++from pyspark.sql import SparkSession+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """++    def __init__(self,+                 *,+                 num_slots,+                 local_mode=False,+                 use_gpu=True,+                 gpu_resource_name='gpu',+                 use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in+            distributed training. For example, if num_slots = 16 we train on the Spark+            cluster with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing+            CPU training. num_slots cannot be less than or equal to 0. Note that when doing+            CPU training, Spark will still be subject to any GPU-aware scheduling confs set+            in the Spark configuration. Note also that for GPU training, num_slots will+            limit the number of GPUs used for training even if more are available, so that+            exactly num_slots GPUs are used in total. Spark does not restrict CPU cores for+            tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and in local mode set num_slots=1.+        :param local_mode: If True, the training function will be run locally on the+            driver. If False training is distributed among the workers.+        :param use_gpu: If True, training is done with GPUs using Spark resource scheduling+            with the gpu_resource_name parameter as the resource name. If False, do CPU only+            training.+        :param gpu_resource_name: The name of the Spark resource scheduling GPU resource. It+            may be set under `spark.executor.resource.{gpu_resource_name}`,+            `spark.task.resource.{gpu_resource_name}`,+            `spark.driver.resource.{gpu_resource_name}`, and+            `spark.worker.resource.{gpu_resource_name}` in the Spark conf. Contact the cluster+            administrator to set these configurations. The resource should be configured with+            a discovery script that is formatted according to the Spark configuration docs.+            Make sure `spark.driver.resource.{gpu_resource_name}.discoveryScript` and+            `spark.driver.resource.{gpu_resource_name}.discoveryScript` are also set in the+            Spark conf. In particular, the GPU addresses should be zero indexed. For example,+            the output of the discovery script for 3 GPUs with gpu_resource_name='gpu' would be+            `{"name": "gpu", "addresses":["0","1","2"]}`. See an example discovery script:+            `github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh`.+        :param use_custom_strategy: When true, the training function passed to the+            MirroredStrategyRunner.run method must construct and use its own+            tensorflow.distribute.Strategy() object. When false, MirroredStrategyRunner+            constructs one for the user and wraps the training function in the strategy+            context, allowing the user to provide non-distributed TensorFlow code that is+            executed as distributed code.++            Example with use_custom_strategy=True:++                def train_fn():+                    import tensorflow as tf+                    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()+                    with strategy.scope():+                        # training code++            Example with use_custom_strategy=False:++                def train_fn():+                    import tensorflow as tf+                    # training code+        """+        self._logger = _get_logger(self.__class__.__name__)+        self._num_slots = num_slots+        if num_slots <= 0:+            raise ValueError(+                f'num_slots is set to {num_slots} but cannot be less than or equal to 0.'+            )+        self._local_mode = local_mode+        self._use_gpu = use_gpu+        self._gpu_resource_name = gpu_resource_name+        self._use_custom_strategy = use_custom_strategy+        if self._use_gpu:+            self._logger.info('Doing GPU training...')+        else:+            self._logger.info('Doing CPU training...')+        spark = SparkSession.builder.getOrCreate()+        self.sc = spark.sparkContext+        if self._local_mode is True:+            self._logger.warning(+                'MirroredStrategyRunner will run in local mode on the driver node. '+                'There would be resource contention if the driver also runs other workloads.'+            )+            self._num_tasks = None+        else:+            self._num_tasks = self.get_num_tasks()+            self._logger.info(f'Will run with {self._num_tasks} Spark tasks.')++    def get_num_tasks(self):+        """+        :return: The number of Spark tasks to use for distributed training+        """+        if self._use_gpu:+            key = f'spark.task.resource.{self._gpu_resource_name}.amount'+            if not self.sc.getConf().contains(key):+                raise Exception(+                    'Your cluster might not have Spark GPU-aware scheduling enabled, '+                    'please contact your cluster administrator.'+                    f'The conf `{key}` was not found in the Spark configuration.'+                )+            task_gpu_amount = int(self.sc.getConf().get(key))+            if task_gpu_amount < 1:+                raise ValueError(+                    f'The Spark conf `{key}` has a value of {task_gpu_amount} but it '+                    'should not have a value less than 1.')+            return math.ceil(self._num_slots / task_gpu_amount)+        return self._num_slots++    def run(self, train_fn, **kwargs):+        """+        :param train_fn: Function that contains TensorFlow training code. If it constructs its own+            tensorflow.distribute.Strategy object, then construct MirroredStrategyRunner with+            use_custom_strategy set to `True`.+        :param kwargs: keyword arguments passed to the training function at invocation time. When+            train_fn is called, it will be called with train_fn(**kwargs).+        :return: return value of the training function from the chief training worker+            (partition ID 0) in distributed mode, or the direct return value of train_fn in+            local mode.+        """+        spark_task_program = self._get_spark_task_program(train_fn, **kwargs)++        # Run in local mode+        if self._local_mode:+            old_cuda_visible_devices = os.environ.get('CUDA_VISIBLE_DEVICES',+                                                      '')+            cuda_state_was_set = 'CUDA_VISIBLE_DEVICES' in os.environ+            try:+                if self._use_gpu:+                    # TODO: handle the case that driver gpu resources is not properly configured+                    gpus_owned = MirroredStrategyRunner._get_gpus_owned(+                        self.sc.resources, self._gpu_resource_name)+                    num_gpus_owned = len(gpus_owned)+                    if self._num_slots > num_gpus_owned:+                        raise ValueError(+                            f'{self._num_slots} slots were requested for local training with '+                            f'GPU training but only {num_gpus_owned} GPUs '+                            'were available.')+                    # TODO: Check GPU utilization to avoid resource contention+                    os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(+                        str(e)+                        for e in random.sample(gpus_owned, self._num_slots))+                else:+                    if self._num_slots > 1:+                        raise ValueError(+                            f'Cannot run with more than 1 CPU machine in local mode. '+                            'Try setting num_slots to -1.')+                    os.environ['CUDA_VISIBLE_DEVICES'] = ''+                result = MirroredStrategyRunner._run_tensorflow_program(+                    train_fn, self._use_custom_strategy, **kwargs)+            finally:+                if cuda_state_was_set:+                    os.environ[+                        'CUDA_VISIBLE_DEVICES'] = old_cuda_visible_devices+                else:+                    del os.environ['CUDA_VISIBLE_DEVICES']+            return result++        # Run in distributed mode+        self._check_encryption()+        self._logger.info('Distributed training in progress...')+        self._logger.info(+            'View Spark executor stderr logs to inspect training...')+        result = self.sc.parallelize(range(self._num_tasks), self._num_tasks) \+            .barrier() \+            .mapPartitions(spark_task_program) \+            .collect()[0]+        self._logger.info(f'Training with {self._num_slots} slots is complete!')+        return result++    @staticmethod+    def _get_gpus_owned(resources, gpu_resource_name):+        """+        Gets the number of GPUs that Spark scheduled to the calling task

Looks like this method is returning "the number of GPUs"?

sarthfrey

comment created time in 2 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import argparse

Curious what this file does? Would mind adding some description in the beginning of this file?

sarthfrey

comment created time in 2 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in distributed training.+            When num_slots < 0, training is done in local mode on the Spark driver, and otherwise+            training is distributed among the workers on the Spark cluster. For example,+            num_slots = -4 means we train locally on 4 slots. If num_slots = 16 we train on the Spark cluster +            with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing CPU training. num_slots cannot be 0.+            Note that when doing CPU training, Spark will still be subject to any GPU-aware scheduling confs set in+            the Spark configuration. Note also that for GPU training, num_slots will limit the number of GPUs used+            for training even if more are available, so that exactly num_slots GPUs are used in total. Spark does not+            restrict CPU cores for tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and for local mode set num_slots=-1.+        :param gpu_resource_name: If None, it will do CPU only training. Otherwise, it will train with GPUs+            where the parameter value is the name of the Spark resource scheduling GPU resource. It may be set+            under `spark.executor.resource.{gpu_resource_name}`, `spark.task.resource.{gpu_resource_name}`,+            `spark.driver.resource.{gpu_resource_name}`, and `spark.worker.resource.{gpu_resource_name}` in the+            Spark conf. Contact the cluster administrator to set these configurations. The resource+            should be configured with a discovery script that is formatted according to the Spark configuration docs.+            Make sure `spark.driver.resource.{gpu_resource_name}.discoveryScript` and +            `spark.driver.resource.{gpu_resource_name}.discoveryScript` are also set in the Spark conf.+            In particular, the GPU addresses should be zero indexed. For example, the output of the discovery script for 3 GPUs+            with gpu_resource_name='gpu' would be `{"name": "gpu", "addresses":["0","1","2"]}`. See an example discovery+            script: `github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh`.+        :param use_custom_strategy: When true, the training function passed to the MirroredStrategyRunner.run method must+            construct and use its own tensorflow.distribute.Strategy() object. When false, MirroredStrategyRunner constructs+            one for the user and wraps the training function in the strategy context, allowing the user to provide+            non-distributed TensorFlow code that is executed as distributed code.++            Example with use_custom_strategy=True:

ah I think that is good enough.

sarthfrey

comment created time in 2 months

push eventyuefengz/community

Yuefeng Zhou

commit sha 6fa632ac9113ea9bc7cc2ee3f3e291860af708af

Change the status to accepted

view details

push time in 3 months

push eventyuefengz/community

Yuefeng Zhou

commit sha ab0f4334813801a39bad466d21fa401fb2b09bc0

Update 20200306-single-client-parameter-server.md

view details

push time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in distributed training.+            When num_slots < 0, training is done in local mode on the Spark driver, and otherwise+            training is distributed among the workers on the Spark cluster. For example,+            num_slots = -4 means we train locally on 4 slots. If num_slots = 16 we train on the Spark cluster +            with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing CPU training. num_slots cannot be 0.+            Note that when doing CPU training, Spark will still be subject to any GPU-aware scheduling confs set in+            the Spark configuration. Note also that for GPU training, num_slots will limit the number of GPUs used+            for training even if more are available, so that exactly num_slots GPUs are used in total. Spark does not+            restrict CPU cores for tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and for local mode set num_slots=-1.+        :param gpu_resource_name: If None, it will do CPU only training. Otherwise, it will train with GPUs+            where the parameter value is the name of the Spark resource scheduling GPU resource. It may be set+            under `spark.executor.resource.{gpu_resource_name}`, `spark.task.resource.{gpu_resource_name}`,+            `spark.driver.resource.{gpu_resource_name}`, and `spark.worker.resource.{gpu_resource_name}` in the+            Spark conf. Contact the cluster administrator to set these configurations. The resource+            should be configured with a discovery script that is formatted according to the Spark configuration docs.+            Make sure `spark.driver.resource.{gpu_resource_name}.discoveryScript` and +            `spark.driver.resource.{gpu_resource_name}.discoveryScript` are also set in the Spark conf.+            In particular, the GPU addresses should be zero indexed. For example, the output of the discovery script for 3 GPUs+            with gpu_resource_name='gpu' would be `{"name": "gpu", "addresses":["0","1","2"]}`. See an example discovery+            script: `github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh`.+        :param use_custom_strategy: When true, the training function passed to the MirroredStrategyRunner.run method must+            construct and use its own tensorflow.distribute.Strategy() object. When false, MirroredStrategyRunner constructs+            one for the user and wraps the training function in the strategy context, allowing the user to provide+            non-distributed TensorFlow code that is executed as distributed code.++            Example with use_custom_strategy=True:++                def train_fn():+                    import tensorflow as tf+                    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()+                    with strategy.scope():+                        # training code++            Example with use_custom_strategy=False:++                def train_fn():+                    import tensorflow as tf+                    # training code+        """+        self.logger = _get_logger(self.__class__.__name__)+        self.num_slots = num_slots+        if num_slots == 0:+            raise ValueError(+                'num_slots cannot be 0.'+            )+        self.use_custom_strategy = use_custom_strategy+        self.gpu_resource_name = gpu_resource_name+        self.is_gpu = self.gpu_resource_name is not None+        if self.is_gpu:+            self.logger.info(+                'Doing GPU training...'+            )+        else:+            self.logger.info(+                'Doing CPU training...'+            )+        self.local_mode = self.num_slots < 0+        spark = SparkSession.builder.getOrCreate()+        self.sc = spark.sparkContext+        if self.local_mode is True:+            self.logger.warning(+                'MirroredStrategyRunner will run in local mode on the driver node. '+                'There would be resource contention if the driver also runs other workloads.'+            )+            self.num_slots = -self.num_slots+            self.num_tasks = None+            self.num_workers = None+        else:+            self.num_tasks = self.get_num_tasks()+            self.logger.info(f'Will run with {self.num_tasks} Spark tasks.')++    def get_num_tasks(self):+        if self.is_gpu:+            key = f'spark.task.resource.{self.gpu_resource_name}.amount'+            if not self.sc.getConf().contains(key):+                raise Exception(+                    'Your cluster might not have Spark GPU-aware scheduling enabled, '+                    'please contact your cluster administrator.'+                    f'The conf `{key}` was not found in the Spark configuration.'+                )+            task_gpu_amount = int(self.sc.getConf().get(key))+            if task_gpu_amount < 1:+                raise ValueError(+                    f'The Spark conf `{key}` has a value of {task_gpu_amount} but it '+                    'should not have a value less than 1.'+                )+            return math.ceil(self.num_slots / task_gpu_amount)+        else:+            return self.num_slots++    def run(self, train_fn, **kwargs):+        """+        :param train_fn: Function that contains TensorFlow training code. If it constructs its own+            tensorflow.distribute.Strategy object, then construct MirroredStrategyRunner with+            use_custom_strategy set to `True`. +        :param kwargs: keyword arguments passed to the training function at invocation time. When+            train_fn is called, it will be called with train_fn(**kwargs)+        :return: return value of the training function from the chief training worker (partition ID 0)+            in distributed mode, or the direct return value of train_fn in local mode+        """+        spark_task_program = self._get_spark_task_program(train_fn, **kwargs)++        # Run in local mode+        if self.local_mode:+            old_cuda_visible_devices = os.environ.get('CUDA_VISIBLE_DEVICES', '')+            cuda_state_was_set = 'CUDA_VISIBLE_DEVICES' in os.environ+            try:+                if self.is_gpu:+                    # TODO: specially handle the case that driver gpu resources is not properly configured+                    gpus_owned = MirroredStrategyRunner._get_gpus_owned(self.sc.resources, self.gpu_resource_name)+                    num_gpus_owned = len(gpus_owned)+                    if self.num_slots > num_gpus_owned:+                        raise ValueError(+                            f'{self.num_slots} slots were requested for local training with '+                            f'GPU training but only {num_gpus_owned} GPUs '+                            'were available.'+                        )+                    else:+                        # TODO: Check GPU utilization to avoid resource contention+                        os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(+                            str(e) for e in random.sample(gpus_owned, self.num_slots)+                        )+                else:+                    if self.num_slots > 1:+                        raise ValueError(+                            f'Cannot run with more than 1 CPU machine in local mode. '+                            'Try setting num_slots to -1.'+                        )+                    os.environ['CUDA_VISIBLE_DEVICES'] = ''+                result = MirroredStrategyRunner._run_tensorflow_program(train_fn, self.use_custom_strategy, **kwargs)+            finally:+                if cuda_state_was_set:+                    os.environ['CUDA_VISIBLE_DEVICES'] = old_cuda_visible_devices+                else:+                    del os.environ['CUDA_VISIBLE_DEVICES']+            return result++        # Run in distributed mode+        self._check_encryption()+        self.logger.info('Distributed training in progress...')+        self.logger.info('View Spark executor stderr logs to inspect training...')+        result = self.sc.parallelize(range(self.num_tasks), self.num_tasks) \+            .barrier() \+            .mapPartitions(spark_task_program) \+            .collect()[0]+        self.logger.info(+            f'Training with {self.num_slots} slots is complete!'+        )+        return result++    @staticmethod+    def _get_gpus_owned(resources, gpu_resource_name):+        """+        Gets the number of GPUs that Spark scheduled to the calling task+        """+        if gpu_resource_name in resources:+            addresses = resources[gpu_resource_name].addresses+            pattern = re.compile('^[1-9][0-9]*|0$')+            if any(not pattern.match(address) for address in addresses):+                raise ValueError(+                    'GPU addresses found are not in the correct format for CUDA_VISIBLE_DEVICES.'+                )+            return addresses+        else:+            raise ValueError(+                f'The provided GPU resource name `{gpu_resource_name}` was not found in the context resources. '+                f'Contact your cluster administrator to make sure that the '+                f'spark.task.resource.{gpu_resource_name}, '+                f'spark.worker.resource.{gpu_resource_name}, '+                f'spark.executor.resource.{gpu_resource_name}, and '+                f'spark.driver.resource.{gpu_resource_name} confs are set and that the '+                f'GPU resource name `{gpu_resource_name}` matches those confs correctly.'+            )++    # Runs the training function+    @staticmethod+    def _run_tensorflow_program(train_fn, use_custom_strategy, **kwargs):+        if use_custom_strategy:+            return train_fn(**kwargs)+        else:+            import tensorflow as tf+            from tensorflow.python.eager import context+            context._reset_context()+            strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()+            with strategy.scope():+                return train_fn(**kwargs)++    def _get_spark_task_program(self, train_fn, **kwargs):+        num_slots = self.num_slots+        use_custom_strategy = self.use_custom_strategy+        gpu_resource_name = self.gpu_resource_name+        num_tasks = self.num_tasks+        is_gpu = self.is_gpu+        get_gpus_owned = MirroredStrategyRunner._get_gpus_owned+        run_tensorflow_program = MirroredStrategyRunner._run_tensorflow_program++        # Spark task program+        def wrapped_train_fn(_):+            import json, logging, os, socket+            from contextlib import closing+            from pyspark import BarrierTaskContext++            # Sets the TF_CONFIG env var so TF servers can communicate with each other+            def set_tf_config(context):+                addrs = [e.address.split(':')[0] for e in context.getTaskInfos()]+                my_addr = addrs[context.partitionId()]+                with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as my_sock:+                    my_sock.bind(('', 0))+                    _, my_port = my_sock.getsockname()+                    my_endpoint = "{}:{}".format(my_addr, my_port)+                    worker_endpoints = context.allGather(my_endpoint)+                cluster = {+                    'worker': worker_endpoints+                }+                tf_config = {+                    'cluster': cluster,+                    'task': {+                        'type': 'worker',+                        'index': context.partitionId()+                    }+                }+                os.environ['TF_CONFIG'] = json.dumps(tf_config)++            # Sets the CUDA_VISIBLE_DEVICES env var so only the appropriate GPUS are used+            def set_gpus(context):+                gpus_owned = get_gpus_owned(context.resources(), gpu_resource_name)+                my_num_gpus = (num_slots // num_tasks) + (context.partitionId() < (num_slots % num_tasks))+                gpu_addresses = [str(e) for e in random.sample(gpus_owned, my_num_gpus)]

Can I assume that all GPUs on a machine will be taken except for the last machine? The reason I ask is randomly chosen GPUs may not have NVLinks in between.

sarthfrey

comment created time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in distributed training.+            When num_slots < 0, training is done in local mode on the Spark driver, and otherwise+            training is distributed among the workers on the Spark cluster. For example,+            num_slots = -4 means we train locally on 4 slots. If num_slots = 16 we train on the Spark cluster +            with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing CPU training. num_slots cannot be 0.+            Note that when doing CPU training, Spark will still be subject to any GPU-aware scheduling confs set in+            the Spark configuration. Note also that for GPU training, num_slots will limit the number of GPUs used+            for training even if more are available, so that exactly num_slots GPUs are used in total. Spark does not+            restrict CPU cores for tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and for local mode set num_slots=-1.+        :param gpu_resource_name: If None, it will do CPU only training. Otherwise, it will train with GPUs+            where the parameter value is the name of the Spark resource scheduling GPU resource. It may be set+            under `spark.executor.resource.{gpu_resource_name}`, `spark.task.resource.{gpu_resource_name}`,+            `spark.driver.resource.{gpu_resource_name}`, and `spark.worker.resource.{gpu_resource_name}` in the+            Spark conf. Contact the cluster administrator to set these configurations. The resource+            should be configured with a discovery script that is formatted according to the Spark configuration docs.+            Make sure `spark.driver.resource.{gpu_resource_name}.discoveryScript` and +            `spark.driver.resource.{gpu_resource_name}.discoveryScript` are also set in the Spark conf.+            In particular, the GPU addresses should be zero indexed. For example, the output of the discovery script for 3 GPUs+            with gpu_resource_name='gpu' would be `{"name": "gpu", "addresses":["0","1","2"]}`. See an example discovery+            script: `github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh`.+        :param use_custom_strategy: When true, the training function passed to the MirroredStrategyRunner.run method must+            construct and use its own tensorflow.distribute.Strategy() object. When false, MirroredStrategyRunner constructs+            one for the user and wraps the training function in the strategy context, allowing the user to provide+            non-distributed TensorFlow code that is executed as distributed code.++            Example with use_custom_strategy=True:++                def train_fn():+                    import tensorflow as tf+                    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()+                    with strategy.scope():+                        # training code++            Example with use_custom_strategy=False:++                def train_fn():+                    import tensorflow as tf+                    # training code+        """+        self.logger = _get_logger(self.__class__.__name__)

For private fields, we usually hide them by prepending a "_" in their names.

sarthfrey

comment created time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in distributed training.+            When num_slots < 0, training is done in local mode on the Spark driver, and otherwise+            training is distributed among the workers on the Spark cluster. For example,+            num_slots = -4 means we train locally on 4 slots. If num_slots = 16 we train on the Spark cluster +            with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing CPU training. num_slots cannot be 0.+            Note that when doing CPU training, Spark will still be subject to any GPU-aware scheduling confs set in+            the Spark configuration. Note also that for GPU training, num_slots will limit the number of GPUs used+            for training even if more are available, so that exactly num_slots GPUs are used in total. Spark does not+            restrict CPU cores for tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and for local mode set num_slots=-1.+        :param gpu_resource_name: If None, it will do CPU only training. Otherwise, it will train with GPUs+            where the parameter value is the name of the Spark resource scheduling GPU resource. It may be set+            under `spark.executor.resource.{gpu_resource_name}`, `spark.task.resource.{gpu_resource_name}`,+            `spark.driver.resource.{gpu_resource_name}`, and `spark.worker.resource.{gpu_resource_name}` in the+            Spark conf. Contact the cluster administrator to set these configurations. The resource+            should be configured with a discovery script that is formatted according to the Spark configuration docs.+            Make sure `spark.driver.resource.{gpu_resource_name}.discoveryScript` and +            `spark.driver.resource.{gpu_resource_name}.discoveryScript` are also set in the Spark conf.+            In particular, the GPU addresses should be zero indexed. For example, the output of the discovery script for 3 GPUs+            with gpu_resource_name='gpu' would be `{"name": "gpu", "addresses":["0","1","2"]}`. See an example discovery+            script: `github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh`.+        :param use_custom_strategy: When true, the training function passed to the MirroredStrategyRunner.run method must+            construct and use its own tensorflow.distribute.Strategy() object. When false, MirroredStrategyRunner constructs+            one for the user and wraps the training function in the strategy context, allowing the user to provide+            non-distributed TensorFlow code that is executed as distributed code.++            Example with use_custom_strategy=True:++                def train_fn():+                    import tensorflow as tf+                    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()+                    with strategy.scope():+                        # training code++            Example with use_custom_strategy=False:++                def train_fn():+                    import tensorflow as tf+                    # training code+        """+        self.logger = _get_logger(self.__class__.__name__)+        self.num_slots = num_slots+        if num_slots == 0:+            raise ValueError(+                'num_slots cannot be 0.'+            )+        self.use_custom_strategy = use_custom_strategy+        self.gpu_resource_name = gpu_resource_name+        self.is_gpu = self.gpu_resource_name is not None+        if self.is_gpu:+            self.logger.info(+                'Doing GPU training...'+            )+        else:+            self.logger.info(+                'Doing CPU training...'+            )+        self.local_mode = self.num_slots < 0+        spark = SparkSession.builder.getOrCreate()+        self.sc = spark.sparkContext+        if self.local_mode is True:+            self.logger.warning(+                'MirroredStrategyRunner will run in local mode on the driver node. '+                'There would be resource contention if the driver also runs other workloads.'+            )+            self.num_slots = -self.num_slots+            self.num_tasks = None+            self.num_workers = None+        else:+            self.num_tasks = self.get_num_tasks()+            self.logger.info(f'Will run with {self.num_tasks} Spark tasks.')++    def get_num_tasks(self):+        if self.is_gpu:+            key = f'spark.task.resource.{self.gpu_resource_name}.amount'+            if not self.sc.getConf().contains(key):+                raise Exception(+                    'Your cluster might not have Spark GPU-aware scheduling enabled, '+                    'please contact your cluster administrator.'+                    f'The conf `{key}` was not found in the Spark configuration.'+                )+            task_gpu_amount = int(self.sc.getConf().get(key))

Is it always True that in a spark cluster all workers have the same number of GPUs?

sarthfrey

comment created time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging

Is it possible for you to follow the Google Python style guide? http://google.github.io/styleguide/pyguide.html

Generally speaking, that guides recommends to use pylint, break lines at 80 chars, use a different form of docstrings etc.

sarthfrey

comment created time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in distributed training.+            When num_slots < 0, training is done in local mode on the Spark driver, and otherwise+            training is distributed among the workers on the Spark cluster. For example,+            num_slots = -4 means we train locally on 4 slots. If num_slots = 16 we train on the Spark cluster +            with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing CPU training. num_slots cannot be 0.+            Note that when doing CPU training, Spark will still be subject to any GPU-aware scheduling confs set in+            the Spark configuration. Note also that for GPU training, num_slots will limit the number of GPUs used+            for training even if more are available, so that exactly num_slots GPUs are used in total. Spark does not+            restrict CPU cores for tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and for local mode set num_slots=-1.

num_slots rarely needs to be greater than the number of workers Does that mean in some cases, the number of available workers can be less than num_slots?

sarthfrey

comment created time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):

Suggestion: would changing it to "use_gpu=True" make it clearer?

sarthfrey

comment created time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):+        """+        :param num_slots: Total number of GPUs or CPU only Spark tasks that participate in distributed training.+            When num_slots < 0, training is done in local mode on the Spark driver, and otherwise+            training is distributed among the workers on the Spark cluster. For example,+            num_slots = -4 means we train locally on 4 slots. If num_slots = 16 we train on the Spark cluster +            with 16 GPUs if doing GPU training, or with 16 Spark tasks if doing CPU training. num_slots cannot be 0.+            Note that when doing CPU training, Spark will still be subject to any GPU-aware scheduling confs set in+            the Spark configuration. Note also that for GPU training, num_slots will limit the number of GPUs used+            for training even if more are available, so that exactly num_slots GPUs are used in total. Spark does not+            restrict CPU cores for tasks and so for CPU training, num_slots rarely needs to be greater than the+            number of workers and for local mode set num_slots=-1.+        :param gpu_resource_name: If None, it will do CPU only training. Otherwise, it will train with GPUs+            where the parameter value is the name of the Spark resource scheduling GPU resource. It may be set+            under `spark.executor.resource.{gpu_resource_name}`, `spark.task.resource.{gpu_resource_name}`,+            `spark.driver.resource.{gpu_resource_name}`, and `spark.worker.resource.{gpu_resource_name}` in the+            Spark conf. Contact the cluster administrator to set these configurations. The resource+            should be configured with a discovery script that is formatted according to the Spark configuration docs.+            Make sure `spark.driver.resource.{gpu_resource_name}.discoveryScript` and +            `spark.driver.resource.{gpu_resource_name}.discoveryScript` are also set in the Spark conf.+            In particular, the GPU addresses should be zero indexed. For example, the output of the discovery script for 3 GPUs+            with gpu_resource_name='gpu' would be `{"name": "gpu", "addresses":["0","1","2"]}`. See an example discovery+            script: `github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh`.+        :param use_custom_strategy: When true, the training function passed to the MirroredStrategyRunner.run method must+            construct and use its own tensorflow.distribute.Strategy() object. When false, MirroredStrategyRunner constructs+            one for the user and wraps the training function in the strategy context, allowing the user to provide+            non-distributed TensorFlow code that is executed as distributed code.++            Example with use_custom_strategy=True:

We can tell whether users do it in the wrong way (use_custom_strategy=False but create a strategy) and throw an exception?

sarthfrey

comment created time in 3 months

Pull request review commenttensorflow/ecosystem

Contribute spark-tensorflow-distributor to the ecosystem

+import logging+import math+import os+import random+import re+import sys+++from pyspark.sql import SparkSession+from pyspark.sql import SQLContext+++def _get_logger(name):+    """+    Gets a logger by name, or creates and configures it for the first time.+    """+    logger = logging.getLogger(name)+    logger.setLevel(logging.INFO)+    # If the logger is configured, skip the configure+    if not logger.handlers and not logging.getLogger().handlers:+        handler = logging.StreamHandler(sys.stderr)+        logger.addHandler(handler)+    return logger+++class MirroredStrategyRunner:+    """+    MirroredStrategyRunner runs TensorFlow deep learning training jobs on Spark clusters.+    It trains synchronously by mirroring the model's variables among all the workers and+    shares the gradient updates in a decentralized manner. MirroredStrategyRunner can take+    a regular TensorFlow program with no special distributed training code and launch it+    as a distributed training program.++    .. note:: See more at https://www.tensorflow.org/guide/distributed_training+    """+    def __init__(self, num_slots, gpu_resource_name='gpu', use_custom_strategy=False):

Suggestion: would breaking it into two or more argument make it clearer? such as num_gpus and num_workers or num_replicas and local. In tf.distribute, we use the term "replica" quite often.

sarthfrey

comment created time in 3 months

more