profile
viewpoint

tensorflow/networking 59

Enhanced networking support for TensorFlow. Maintained by SIG-networking.

bytedance/tensorflow 7

An Open Source Machine Learning Framework for Everyone

byronyi/coflow 6

Application aware network QoS

liutongxuan/tensorflow 5

Open source software library for numerical computation using data flow graphs.

byronyi/docs 1

TensorFlow documentation

byronyi/models 1

Models and examples built with TensorFlow

byronyi/networking 1

Enhanced networking support for TensorFlow. Maintained by SIG-networking.

byronyi/tensorflow 1

An Open Source Machine Learning Framework for Everyone

byronyi/bazel-compilation-database 0

Tool to generate compile_commands.json from the Bazel build system

issue commentpytorch/pytorch

[RFC] Add tar-based IterableDataset implementation to PyTorch

PyTorch Dataloader is a pain in the ass for any data not reside in mounted file system. It encourages the users to store a huge number of small files which is an effective DoS attack to the storage system itself.

The whole multi worker pre-processing design causes huge amount of wasted CPU cycles and even GPU hours because of frequent OOMs of work processes.

It’s a nightmare for infrastructure SREs to keep the training platform functioning for all these broken data pipelines written by PyTorch users.

Maybe an average PyTorch user only uses low to medium end GPUs (so no need to feed 10k images into an 8-V100 DGX-1) and their datasets fit in a local NVMe SSD.

tmbdev

comment created time in 7 days

pull request commentpytorch/xla

Remove serialization patch.

Hi, will this and #2302 be cherry-picked to r1.6?

ailzhang

comment created time in 9 days

startedPyTorchLightning/pytorch-lightning

started time in 10 days

pull request commentpytorch/pytorch

Make a separate cmake option for caffe2 tests

This has been reverted in d4edbbd39674740465ea303fc315a963b16a4516. Anyone knows why?

suo

comment created time in 18 days

issue commenttensorflow/tensorflow

tensorflow-io import error caused by "com_google_absl"

@hlopko @gunan @impjdi I just checked that f805153a25b00d12072bd728e91bb1621bfcf1b1 checked-in yesterday fixed my issue above. Thanks!

CuiYifeng

comment created time in 19 days

issue commentpytorch/xla

why xla_model.xla_device() so slow at GPU backend?

It usually means that the local TF (or libxla_computation_client.so) you use are built with mismatched CUDA compute capabilities, e.g. built with 3.5,6.0 but running on V100 (7.0) hardware. It won't be a hard failure, but the CUDA driver will compile just in time for all the PTX code, which could take up to an hour in some cases.

qianzhang613

comment created time in 20 days

pull request commenttensorflow/community

RFC: tf.data Snapshot

@frankchn @aaudiber @rohan100jain I have a question regarding to use snapshot in conjunction with tf.data service. Would it be feasible to shard the snapshot into multiple tf.data service workers?

frankchn

comment created time in 22 days

issue commenttensorflow/tensorflow

No Debug Symbols when using per_file_copt option

//tensorflow/c/..cc,//tensorflow/cc/./..cc,//tensorflow/core/./.*.cc is not a valid path filter.

See https://docs.bazel.build/versions/master/user-manual.html#flag--per_file_copt for an example.

PS: you could try //tensorflow/c/.*\.cc,//tensorflow/cc/.*\.cc,//tensorflow/core/.*\.cc@-g,-O0

makn87

comment created time in 23 days

issue commenttensorflow/tensorflow

No Debug Symbols when using per_file_copt option

I would suggest you to use -c opt with the per file copt option.

makn87

comment created time in 23 days

issue commenttensorflow/tensorflow

No Debug Symbols when using per_file_copt option

You could take a look at https://github.com/tensorflow/tensorflow/issues/27495#issuecomment-481101932

makn87

comment created time in 25 days

CommitCommentEvent

issue commentpytorch/pytorch

Python dataloader Improvements

Most parts of the improvement plan seem to be stalling. @SsnL has left FB and at best works on PyTorch part time.

Any updates from the team?

SsnL

comment created time in a month

issue commentpytorch/pytorch

[RFC] Add tar-based IterableDataset implementation to PyTorch

Another question: how does WebDataset plan to deal with fault-tolerance? Do you plan to integrate HTTP range request and store the offset by which you have processed in the model checkpoint, or the users need to restart from the beginning of an epoch?

tmbdev

comment created time in a month

startedNVlabs/tensorcom

started time in a month

startedtensortrade-org/tensortrade

started time in a month

issue openedpytorch/xla

Roadmap for RPC Service

Currently torch/xla uses multiple XRT sessions to improve OpByOp and data uploading performance (correct me if I am wrong!).

I am wondering if there is plan to switch to a native TF eager service, which has better support for mixing OpByOp execution and Graph execution? AFAIK the TF eager runtime will play a major role in TF 2.x and if torch/xla works well, we could have an unified XLA/MLIR runtime for TF and Torch (perhaps based on TFRT).

created time in a month

issue commenttensorflow/tensorflow

Add public API to get CUDA version and path, CUDNN version and path.

@gunan Just found out that it's been fixed by 67099d08529a00f48ee8a444f471067bedcc2f5b. So no need to open a new issue :)

gabrieldemarmiesse

comment created time in a month

issue commenttensorflow/tensorflow

Add public API to get CUDA version and path, CUDNN version and path.

@angerson It seems that building tf-nightly now requires opt_einsum in host environment:

Mon Jun 8 13:08:20 UTC 2020 : === Building wheel
2020-06-08 13:08:21.249561: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "setup.py", line 47, in <module>
    from tensorflow.python.platform import build_info
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/__init__.py", line 41, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/__init__.py", line 47, in <module>
    from tensorflow.python import keras
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/keras/__init__.py", line 27, in <module>
    from tensorflow.python.keras import models
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/keras/models.py", line 23, in <module>
    from tensorflow.python.keras import backend as K
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/keras/backend.py", line 37, in <module>
    from tensorflow.python.distribute import distribute_coordinator as dc
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/distribute/distribute_coordinator.py", line 33, in <module>
    from tensorflow.python.training import monitored_session
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/training/monitored_session.py", line 45, in <module>
    from tensorflow.python.training.tracking import graph_view
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/training/tracking/graph_view.py", line 27, in <module>
    from tensorflow.python.training import optimizer as optimizer_v1
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/training/optimizer.py", line 36, in <module>
    from tensorflow.python.ops import gradients
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/ops/gradients.py", line 27, in <module>
    from tensorflow.python.ops.gradients_impl import gradients
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/ops/gradients_impl.py", line 30, in <module>
    from tensorflow.python.ops import linalg_grad  # pylint: disable=unused-import
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/ops/linalg_grad.py", line 47, in <module>
    from tensorflow.python.ops.linalg import linalg_impl as _linalg
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/ops/linalg/linalg_impl.py", line 34, in <module>
    from tensorflow.python.ops import special_math_ops
  File "/tmp/tmp.7oZofsqRJ1/tensorflow/python/ops/special_math_ops.py", line 30, in <module>
    import opt_einsum
ImportError: No module named 'opt_einsum'

which seems to break tf-nightly on Linux since your CL landed; not sure if it's going to be fixed in TF or the build environment.

gabrieldemarmiesse

comment created time in a month

pull request commenttensorflow/community

RFC: Kernel and Op Implementation and Registration API

@annarev Thanks for rolling things forward.

Btw, would you mind to share with the community the current status of the "Voltron" project?

sjamesr

comment created time in a month

issue commenttensorflow/tensorflow

Distributed TensorFlow scaling efficiency with CUDA_VISIBLE_DEVICES

@nluehr Thanks for the update! Does this affect containers as well?

leesharma

comment created time in a month

startedxuanyi-fu/xynet

started time in a month

startedopenai/gpt-3

started time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive.+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training.++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++pred = tf.XXX(...)++# backward graph+loss = tf.XXX(pred, label, ...)+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   Minimize changing on Tensorflow core and try to use native components.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of **distributed strategy**(including parameters server.)+++### Detailed Design and Implementation++The trainable warp class `resource_variable_ops.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space(**To Be Discussed**)+    *  USER API: `tf.dynamic_embedding`+    *  Tensorflow Core: `from tensorflow.xx.xx import dynamic_embedding_ops`  +    +    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.resource_variable_ops.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.+Maybe the `tf.nn` or `tf.keras` would be better choices.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export("dynamic_embedding.Variable")+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the weights of sparse models such as DLRM.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):

Adding some docs would help.

rhdong

comment created time in a month

startedtensor-compiler/taco

started time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate+* Run discovery and add any new nodes up to max number of nodes+* Blacklist nodes that did not ping chief (or next candidate)+* broadcast new cluster spec+* if number of nodes doesn’t meet minimum threshold, restart membership process++### _Update Cluster_++The update cluster process happens when a node receives a new cluster spec. This cluster spec needs to modify the worker node to discover or blacklist nodes. Currently TensorFlow doesn’t allow for dynamic cluster sizes. There are several identified things that need to be changed in order to allow for dynamic sizes. ++* Collective algorithms will need to be dynamic+    * There are a node def attributes that need to become inputs that can be set at runtime+        * **group_size** and **group_key** will need to be Inputs

Does rebuilding graph/function costs a lot if group size changes? I could imagine that the ring reduce algorithm might need a different packing scheme for different number of workers.

I’m not sure if that’s part of the 3 minutes initialization mentioned above...

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate

That’s also my concern. We probably need a consensus algorithm, e.g. Paxos or Raft, if we wanted to do leader election in a proper way.

sboshin

comment created time in a month

startedresume/resume.github.com

started time in a month

issue commenttensorflow/tensorflow

tensorflow-io import error caused by "com_google_absl"

It looks like the issue was root caused?

@gunan My issue was that my RBE builds with -c opt does not correctly set -DNDEBUG and hit Abseil's assertion during unit testing, which was introduced in the aforementioned patch. Current workaround is a manual --copt=-DNDEBUG, but I'd like to see why -DNDEBUG is not set in the RBE platform and/or toolchain definitions.

CuiYifeng

comment created time in a month

startedcloudendpoints/esp

started time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). 

I saw that being discussed in alternative design 1. Thanks!

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. 

It would be great if we could go deeper and analyze which parts of the initialization overhead are essential, and which parts could be delayed or optimized out. The analysis might help even we decide to go with saving/restoring in-memory checkpoints.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**

Not all global batch size are divisible by scaled number of workers. And for XLA compiled kernels, adjusting local batch size will trigger re-compilation as well. That being said, we might come up with some remedies for these problems.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**

Then you might have layers such as BatchNorm dealing with different local batch size.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). 

I might miss something here, but it seems to me that the proposed recovery mechanism is similar to an in-memory save/restore of checkpoints?

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**++In the event of membership changes, to accommodate the new cluster size, the dataset will have to be re-distributed among the new set of working nodes. This will cause the model to lose determinism while training the model.+Prefetching datasets+Any stateful variables that are not created under the MWMS scope, will not be restored. All pertinent variables needed have to be created under the MWMS scope.+++### _Workflow_++![Elastic Training Flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow.png)++++The workflow shown above goes through major and typical situations during elastic training. This occurs within the strategy run function.++* Normal process where the state is saved, but the function runs correctly. +* A membership change process that adds a new node (W2, W3). +* A communications failure where NetworkExceptions are raised, which initiates a membership change process to remove failing node. +* A node failure which causes a node to fail, and propagate by causing the system a network failure on the next iteration. +* A normal termination process by OutOfRangeException++### _Variable Sync Process_++The variable sync process is where the chief broadcasts its variables to the other workers to keep them in sync at the beginning. On a new cluster, each node will create their variables locally, then they will synchronize their variables with the chief. The Variable sync process can also be started when there is a new node being added to the cluster. This new node will need to synchronize its variables with the chief. Currently the chief will broadcast to everyone and only the new nodes will actually modify their variables. ++* Broadcast from chief+    * Catch NetworkError and start Membership Change Process+* Assign variables from broadcast++**Datasets**++When a new process joins the run call, the dataset of the new nodes and the old nodes will be out of sync. Each dataset operator internally has an iterator object which can be restored through MakeIteratorFromCheckpoint. The chief will send its serialized iterator to be restored on the new nodes. This will bring the new nodes to the last iterated state of the dataset. +++### _Membership Change Process_++The membership change process happens when the cluster needs to be modified. This will take place when new nodes are being added to the cluster, or there is a NetworkException raised. The membership change process will attempt to figure out failed nodes, new nodes, and send out cluster spec updates to each worker. ++* Everyone ping chief+    * timeout for chief to respond+    * If chief doesn’t respond ping next chief candidate+* Run discovery and add any new nodes up to max number of nodes+* Blacklist nodes that did not ping chief (or next candidate)+* broadcast new cluster spec+* if number of nodes doesn’t meet minimum threshold, restart membership process++### _Update Cluster_++The update cluster process happens when a node receives a new cluster spec. This cluster spec needs to modify the worker node to discover or blacklist nodes. Currently TensorFlow doesn’t allow for dynamic cluster sizes. There are several identified things that need to be changed in order to allow for dynamic sizes. ++* Collective algorithms will need to be dynamic+    * There are a node def attributes that need to become inputs that can be set at runtime+        * **group_size** and **group_key** will need to be Inputs+```+        * REGISTER_OP("CollectiveReduce")+                .Input("input: T")+                .Output("data: T")+                .Attr("T: {float, float16, float64, int32, int64}")+                .Attr("group_size: int")+                .Attr("group_key: int")+                .Attr("instance_key: int")+                .Attr("merge_op: {'Min', 'Max', 'Mul', 'Add'}")+                .Attr("final_op: {'Id', 'Div'}")+                .Attr("subdiv_offsets: list(int)")+                .Attr("wait_for: list(int) = []")+                .Attr("communication_hint: string = 'auto'")+                .SetIsStateful()+                .SetShapeFn(shape_inference::UnchangedShape);+```+* The Server Class in server_lib.py will need a new api “update_cluster_spec”+    * This will update the server with the new cluster spec, to add new nodes and remove new nodes+    * This will also require all necessary C++ changes needed to update_cluster_spec+        * server_lib.cc++_**Datasets**_++MWMS currently has an experimental distributed dataset function. This function automatically shards the dataset across the different workers and tries to Rebatch the dataset to enforce a local batch size from the global batch size.++```+def _experimental_distribute_dataset(self, dataset):+    input_context = self._make_input_context()+    return input_lib.get_distributed_dataset(+        dataset,+        self._input_workers,+        self._container_strategy(),+        split_batch_by=self._num_replicas_in_sync,+        input_context=input_context)+```++To enforce the global batch to local batch ratio, RebatchDataset Operator num_replicas will need to be updated to reflect the new cluster. ++```+REGISTER_OP("RebatchDataset")+    .Input("input_dataset: variant")+    .Input("num_replicas: int64")+    .Output("handle: variant")+    .Attr("output_types: list(type) >= 1")+    .Attr("output_shapes: list(shape) >= 1")+    .Attr("use_fallback: bool = true")+    .SetShapeFn(shape_inference::ScalarShape);+```++Another aspect of the dataset distribution is the auto sharding process.++```+class _AutoShardDataset(dataset_ops.UnaryDataset):+  """A `Dataset` that shards the `Dataset` automatically.++  This dataset takes in an existing dataset and tries to automatically figure+  out how to shard the dataset in a multi-worker scenario. Currently, it uses+  Grappler to walk up the dataset graph until it finds a reader dataset (e.g.+  CSVDataset, TFRecordDataset), then inserts a ShardDataset op before that node+  so that each worker only sees some files.++  Args:+    num_workers: Total number of workers to shard this dataset across.+    index: The current worker index (out of the total number of workers) this+      dataset is for.++  Raises:+    NotFoundError: If we cannot find a suitable reader dataset to begin+      automatically sharding the dataset.+  """+```++The sharddataset operator is defined++```+REGISTER_OP("ShardDataset")+    .Input("input_dataset: variant")+    .Input("num_shards: int64")+    .Input("index: int64")+    .Output("handle: variant")+    .Attr("require_non_empty: bool = false")+    .Attr("output_types: list(type) >= 1")+    .Attr("output_shapes: list(shape) >= 1")+    .SetShapeFn([](shape_inference::InferenceContext* c) {+      shape_inference::ShapeHandle unused;+      // num_shards should be a scalar.+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));+      // index should be a scalar.+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));+      return shape_inference::ScalarShape(c);+    });+```++Since the ShardDataset Operator takes index, and num_shards as inputs, these values need to be variable at runtime to resume the dataset iteration from the restored point with a new sharding policy.++All nodes will modify their num_shards input to the AutoSharding added ShardDataset Operator to reflect the new cluster size, and the new nodes will have their index changed to match their task_id.++*There might be additional graph components that depend on cluster_spec that must be updated in addition to the items listed above.*++### _Saving and restoring State_++When a fault occurs in the function being run, the current state of the cluster will need to be discarded because the cluster will be in an undetermined state. To continue training, the cluster will need to be restored to a previously known stable state. This state should be saved often enough to be able to further the training progress but not too often to the point of degrading the training. Saving the state can be a relatively expensive operation relative to the function being run. There will be a strategy option to set how often the state should be saved relative to the number of times the function is run. This will allow the user to balance the trade off between restart progress and performance.+The Strategy doesn’t take a model, optimizer, or dataset as an input directly. This means we can’t directly checkpoint the model, optimizer, and dataset. The strategy does have access to all the variables created under the scope. It also has access to the dataset if the user uses the experimental distribute dataset function. With these options, to save the state, for each variable created we will copy each variable to CPU memory, which will still reside in the graph. ++++![Elastic Save Restore flow](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_save_restore.png)+![Elastic Save Restore Iterator](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_save_restore_iterator.png)++* Saving State+    * Strategy has access to the function being run, the current graph, and all variables that have been created in the graph.+    * If using a distributed dataset, it also has access to the dataset, and its distributed iterator+    * To save state of variables (Cover optimizers and weights)+        * Read each variable into CPU memory by creating copies of variables on the CPU device (To save CPU memory, do this for only 1 GPU, and restore to all GPUs)+    * To save state of iterators (Covers dataset index)+        * Serialize state of iterator using [SerializeIterator](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/dataset_ops.cc#L820) Op (This is the iterator resource stored in the resource manager)+* Restoring State+    * Re-assign variables with backup data+    * Deserialize state of iterator using [DeserializeIterator](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/dataset_ops.cc#L829)++_**All nodes Saving vs Chief Saving**_++If only the chief saves its state to restore the others nodes when needed. The chief will be responsible for broadcasting its state in the event of a roll back. This will add a broadcast overhead. A restore only happens when a failure occurs.+An example overhead for a 500MB model on a 10GB/s network would be about 49ms.++**_Alternate Designs (Do any of these alternate designs help?)_**++Design #1++Create a new API in the strategy for checkpointing. The new api will take in a checkpoint obj that will be used to save and restore the graph variables. Checkpoint will need to be extended to also checkpoint iterators to rewind datasets, this will be checkpointing to an in-memory filesystem.++```+checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)+mwms_strategy.set_checkpoint(checkpoint)+```++Design #2++In TF2.0 variables are created in the resource manager, and so are iterators. Creating an operator to directly save and restore the resource manager, should save and restore the variables without modifying the graph. ++* Create SerializeResourceMgr, DeSerializeResourceMgr operators++Design #3++There is a ResourceMgr per device. Create a backup CPU device, which will backup other devices resource managers.+Session will have a new api to backup and restore resource managers++Design #4++Modify the ResourceMgr to backup directly to CPU memory.+++### _Fn + Discovery_++While running the function, a separate thread per node should run a simple check with the chief. This thread will expect a small broadcast call to be made. This might impact network stability of the fn, but the broadcast itself is an ack to move to membership change process from the chief. ++* Start MRT (MirrorReplicaThreads) for GPU compute+* Start discovery thread+    * If chief, run discovery script+    * chief will then broadcast to workers to start a membership change or continue+* If a membership change process is requested+    * After fn has finished that iteration, it will start a membership change process++### _Additional Chief Duties_++* The chief role is given to Worker 0 in the cluster spec++    * `'cluster': {'worker': ["localhost:12345", "localhost:23456"]}`+    * New cluster specs will only change order if that node has been blacklisted+    * Chief will only change if the chief has been blacklisted+    * New nodes are added to the end of the worker list++* Run script to discover new nodes+* Update cluster spec and send cluster spec to other workers+* Initialize variables+    * broadcast to workers+* Determine faulty nodes+    * A health check in membership process++### **_Fault Tolerance_**++To add fault tolerance to MWMS there are a few major changes that need to take place. The strategy needs to be able to handle different types of faults, save a state, and recover from said state. In synchronous training, there are several types of faults that can happen. We will classify these faults into two main sections, recoverable and non-recoverable faults. ++_**Non-recoverable faults**_++These are faults that put the system in a position it won’t recover from. For example, a hung system, a deadlocked system, user error which causes the system to crash. These faults can’t be recovered from because the system has no way of either fixing the error, or knowing there is an error to be fixed. For such situations an appropriate error will be raised if possible. ++_**Recoverable faults**_++Recoverable faults are the faults which the MWMS strategy can recover from. The MWMS strategy can recover from a network fault, a node fault, or a hardware fault. ++A **network fault** is any failure arising from a node being uncommunicative, or failing anywhere in the communication code. This fault can manifest through communicating with other nodes. A **node fault** is any fault that causes the entire node to crash, and a **hardware fault** would specifically be any hardware going bad. A hardware fault and node fault are treated similarly because MWMS can’t run with heterogeneous number of GPUs in the cluster, which means if a GPU goes down, the node will also go down. ++For networking faults, MWMS strategy uses collective operators for cross device communication. These functions will have to raise a NetworkingError exception when there is a fault. This exception will be handled by the coordinator. +For example, the [batch_reduce_to](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/mirrored_strategy.py#L630) function of the MWMS strategy should raise a NetworkingError on a communication error. ++If a worker gets a NetworkingError exception, the worker will go into a membership change state.++![Elastic Training Flow Network Exception](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_flow_network.png)++A **hardware fault** manifests by GPU or CPU failure. This will be assumed to be any other fault that isn’t handled by the distributed coordinator. These faults should bring the offending node down. Once the node is down, on the next function call, the system will receive a NetworkingException, and since the offending node is down, it will blacklist it, restore and continue.++A **node fault** manifests from the node itself going down. This usually means the node was pre-empted, a sudden shutdown, a sudden restart. These faults are the same as hardware faults once the node is down. The system will know a node is down on the next function call.++![Elastic Training node fault](./20200525-elasticity-and-fault-tolerance-in-multiworkermirroredstrategy/elastic_training_node_fault.png)+++## Elastic Features for TF Config Cluster Resolver++Elastic features of MWMS need extra information: min_nodes, max_nodes, discovery_script location+Extend tf_config_cluster_resolver to handle these optional parameters.++## Multi-Worker Runner++The multi worker runner is a cli that is used to run the multi worker jobs, while setting all the required environment variables and roles that are required to run. This will automatically set the TF_CONFIG env, and assign different roles to each node based on the discovery script defined.  This is useful regardless of elastic training, elastic training will add min nodes, max nodes and discovery script options.++```+tensorflowrun --min_nodes Y --max_nodes Z \+    --discovery disco_script.py --hostfile hostfile python train_script.py+```++Basic design:++* run discovery script+* Parse hostfile+* Create inital per worker cluster_spec+    * run training script with available hosts.+* keep running discovery script+    * if new hosts, start training script with updated cluster spec++## Programming Flow++* Create Training script as you would for tf.MultiWorkerMirroredStrategy+* Create discovery script for your cluster+* Use tensorflowrun to run your script [this can be a specific, or general cli]+    * tensorflowrun —discovery_script discovery_script.sh —min_nodes X —max_nodes Y python training_script.py++### Alternatives Considered++* Similar to the above design, but only save the state before every run call (This is usually the start of an epoch)+    * Before executing run call, Save state+    * During runs, before OutOfRangeException+        * If any nodes fail, Run membership process, restore all nodes to saved state, continue+    * Frequent node failures, and longer epochs are prone to restarts and less progress will be made+* Similar to the above design, but on membership changes, completely rebuild the whole session with the updated cluster spec. +    * This will incur any huge startup costs, which up to 3 mins is normal+        * https://github.com/tensorflow/tensorflow/issues/25057+* Add Checkpointing to Strategy+    * On failure Auto restart from Checkpoint+        * Restarting can either No session rebuild or Full session rebuild+    * Documentation of proper usage of Saving state (should only be saved at start/end of epoch)++### Performance Implications++* Speed Implications+    * As variables are being updated, there will be a backup process to enable recovery. This process will have performance implications. +    * Additional network calls will impact communication time+    * Time taken to modify cluster size, initialize variables of new nodes will also impact performance+* Memory implications+    * CPU memory will increase as we save the variables in CPU memory.+* Benchmarks comparing dedicated nodes vs nodes with planned failure to measure impact of elasticity+* End to end tests will utilize the Multiprocess runner and run small training jobs with planned failures. These tests will be added to collective_all_reduce_strategy_test.py++### Tutorials and Examples++* The run script to specify min/max nodes will be a new cli+    * tensorflowrun —min_nodes X —max_nodes Y —discovery_script discover.sh —hostfile hostfile python training_script.py+* Alternative+* `multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(min_nodes=X, max_nodes=Y)`+* A tutorial on creating discovery scripts

TF ClusterResolver supports ClusterSpec update and propagation. Is there any other specific features we need to support using a discovery script, and if there is, could we extend the current TF ClusterResolver?

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. 

Membership changing adds overhead, too. I’m not convinced that the performance gain outweighs the extra design complexity.

It would be better to add some preliminary experiment results and/or a prototype of this design here.

sboshin

comment created time in a month

Pull request review commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

+# Elasticity and Fault tolerance in MultiWorkerMirroredStrategy (MWMS)++|Status	|Proposed|+|---	|---	|+|**RFC #**	|[NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)	|+|**Author(s)**	|Samuel Oshin ([sboshin@amazon.com](mailto:me@example.org))|+|**Sponsor**	|A N Expert ([whomever@tensorflow.org](mailto:whomever@tensorflow.org))	|+|**Updated**	|2020-05-25	|+++## Objective++Elastic training will enable distributed training to leverage dynamic clusters. This will help users leverage pre-emptible instances, allows users to start training faster, and potentially reduce cost to train. +The goal is to enable elastic training in the MultiWorker Mirrored Strategy.++## Motivation++**Value**++Elastic training gives users flexibility between performance and cost. Current distributed training methods don’t allow for fault tolerance, which means all nodes have to be healthy throughout the duration of the training session. If a node goes down, the whole training sequence will have to be restarted, or in well planned scenarios, the user will have to manually continue the training session from a saved checkpoint. The other problem this solves is allowing the utilization of preemptible instances, which can save up to 80% in training cost in a cloud environment. Elastic training also allows for distributed training jobs to begin faster by only waiting for the minimum number of required nodes, with additional nodes being incorporated into the training process as resources become available. Elastic training also enables dynamic resource management in environments where resources are fixed++**Background**++The native method for distributed training is to use one of the many distribution strategies created. There are several distribution strategies created for a specific purpose++|Distribution Strategy	|Use Case	|+|---	|---	|+|MirroredStrategy	|Multiple GPUs one machine	|+|TPUStrategy	|Multiple TPUs one machine, TPU collectives	|+|MultiWorkerMirroredStrategy	|Multiple nodes, potentially multiple GPUS (GPUs need to be uniform)	|+|CentralStorageStrategy	|Multiple GPUS one machine; Variables are stored on the CPU	|+|ParameterServer	|Parameter servers, Multiple nodes, Multiple GPUs	|+|OneDeviceStrategy	|Single Device	|++Elastic training applies to MultiWorkerMirroredStrategy (MWMS) and ParameterServerStrategy (PSS). For this design we will focus on MWMS but the foundation of MWMS should apply to PSS. +MWMS is a distributed strategy that takes a user provided function, where all its operations are created within the strategy’s scope, and runs this function while keeping all the variables in sync between all workers in the cluster. This is accomplished by using a distributed coordinator which uses a GRPC session and collective communication algorithms. Currently the cluster specifications are static. Also, the graph compute and collective operation faults aren’t handled. This means non-recoverable hardware failures will halt, hang, or crash the training job.++**Users affected**++Currently the users affected by this problem are users with large distributed training jobs, users who can’t afford dedicated cloud hardware, and users with very strict hardware resources. ++1. Large distributed training jobs+    1. As the number of nodes increase, so does the chance of a failure+    2. As the number of nodes increase so does the chance of getting faulty hardware+2. Expensive dedicated cloud hardware+    1. Preemptable instances are about 3x cheaper than on-demand instances in average across different cloud providers+3. On-premise clusters with shared worker pool++**Related Work**++Horovod, an API used to achieve distributed training in several ML Frameworks has started implementation of elastic training. +https://github.com/horovod/horovod/issues/1907+https://github.com/horovod/horovod/pull/1849 [MERGED]++Pytorch has released 2 pre-release versions with different architectures.+https://github.com/pytorch/elastic/tree/master/torchelastic++Tensorflow ParameterServerStrategy RFC: Future work aims to address elasticity, but current RFC will address fault tolerance+https://github.com/tensorflow/community/blob/master/rfcs/20200306-single-client-parameter-server.md++MxNet Dynamic Training+https://github.com/awslabs/dynamic-training-with-apache-mxnet-on-aws++## User Benefit++*How will users (or other contributors) benefit from this work? What would be the headline in the release notes or blog post?*+Users will benefit by having a fault tolerant and potentially cost savings distributed training feature.+++## _Design Proposal_++## MWMS Background++MultiWorkerMirroredStrategy has two roles: a chief and worker. Both roles run the graph and participate in collective communication. The chief is responsible for a few extra tasks such as checkpointing. Depending on the run mode (graph or eager) a distributed coordinator will be used or regular standard server created with the cluster spec representing the cluster, respectively. This will setup the GRPC session which is responsible for communicating between nodes. +The graph is loaded within the strategy scope which will load the variables and operators in the graph. Running the training step function will run the graph inside the GRPC session allowing synchronization across the nodes.  Within this function when the user wants to modify a variable, they can [synchronize](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended#update) the variable modification across the cluster by explicitly running a merge_call on the variables and updates. With this strategy the distributed graphs can always be kept in sync.+Example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py#L111++For simple synchronous training, these variable updates are handled inside of the optimizer when [applying gradients](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L545). The user can manually update the variable by following these steps.++[How to update a variable](https://www.tensorflow.org/api_docs/python/tf/distribute/StrategyExtended)++The standard pattern for updating variables is to:++1. In your function passed to tf.distribute.Strategy.experimental_run_v2, compute a list of (update, variable) pairs. For example, the update might be the gradient of the loss with respect to the variable.+2. Switch to cross-replica mode by calling tf.distribute.get_replica_context().merge_call() with the updates and variables as arguments.+3. Call tf.distribute.StrategyExtended.reduce_to(VariableAggregation.SUM, t, v) (for one variable) or tf.distribute.StrategyExtended.batch_reduce_to (for a list of variables) to sum the updates. and broadcast the result to the variable's devices.+4. Call tf.distribute.StrategyExtended.update(v) for each variable to update its value.++### [Roles](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration)++There are two components of `TF_CONFIG`: `cluster` and `task`. `cluster` provides information about the training cluster, which is a dict consisting of different types of jobs such as `worker`. In multi-worker training, there is usually one `worker` that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular `worker` does. Such worker is referred to as the 'chief' worker, and it is customary that the `worker` with `index` 0 is appointed as the chief `worker` (in fact this is how `tf.distribute.Strategy` is implemented). `task` on the other hand provides information of the current task. The first component `cluster` is the same for all workers, and the second component `task` is different on each worker and specifies the `type` and `index` of that worker. +++## _**Adding Elasticity**_++To add elasticity and fault tolerance to MWMS, the system needs to be fault tolerant, and expandable. For fault tolerance, certain faults need to be identifiable and handled explicitly such as communication errors, user error, node errors. These faults will have certain impacts to the system which can reduce the system size and force the system to recover a previously known state. To expand, the system needs to be able to identify new nodes and modify the system to use the new nodes.++Currently in TF, keras has implemented a fault tolerance mechanism. When a node gets pre-empted and is able to restart, keras will wait for the node to come back online, and [everyone will restart from the last saved checkpoint](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#fault_tolerance). ++### **_Design Summary_**++This design aims to augment the Multi-Worker Mirrored Strategy to add elasticity while avoiding start/initialization times. In the event of cluster shrinkage or expansion, we don’t want to restart the process/session, but want session to adapt to the cluster changes, avoiding re-compilation, re-optimization, and other usual first time costs. +++### **_CAVEAT: Model Determinism_**

The local/global batch size might change, and the model optimizer needs to take that into account. It’s a little bit more than just “model indeterminism”, and I suggest you to take that into account in this RFC.

sboshin

comment created time in a month

pull request commenttensorflow/community

[RFC] Elasticity and Fault tolerance in Multi-WorkerMirroredStrategy

Thanks for the proposal! It would a valuable feature for distributed model training.

However, I’m not sure current RFC is sufficient for a feasible implementation of your design. Given the complexity it adds to the TF distributed runtime, some preliminary prototypes might be helpful.

Some discussion before: https://github.com/tensorflow/tensorflow/issues/18880#issuecomment-390705270

sboshin

comment created time in a month

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_xxx` API to train `dynamic_embdding.TrainableWrap` naturally after some changing a little common code:++- `ops/resource_variable_ops.py`: looking up the newest values to memory hold by `TrainableWrap->tensor()` before starting each iteration steps.+- `training/optimizer.py`: creating the slots of `TrainableWarp` when primary `Variable` is an instance of `TrainableWarp`.+- `training/optimizer.py`: writing back the trained weights of primary and slots to  hash tables of `dynamic_embedding.Variable` and after `Optimizer._resource_apply_xxx`.++The pseudo-code is below: ++```python+# ops/resource_variable_ops.py:++class BaseResourceVariable(variables.VariableV1):+  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):

Perhaps override _read_variable_op instead of directly modifying base class?

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):

"Upsert" is not a good choice of wording. Try more idiomatic names such as InsertOrUpdate.

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \

No assert please.

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,

s/warp/warp/

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_xxx` API to train `dynamic_embdding.TrainableWrap` naturally after some changing a little common code:++- `ops/resource_variable_ops.py`: looking up the newest values to memory hold by `TrainableWrap->tensor()` before starting each iteration steps.+- `training/optimizer.py`: creating the slots of `TrainableWarp` when primary `Variable` is an instance of `TrainableWarp`.+- `training/optimizer.py`: writing back the trained weights of primary and slots to  hash tables of `dynamic_embedding.Variable` and after `Optimizer._resource_apply_xxx`.++The pseudo-code is below: ++```python+# ops/resource_variable_ops.py:++class BaseResourceVariable(variables.VariableV1):+  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):+      with ops.control_dependencies([+          gen_resource_variable_ops.assign_variable_op(+            self._handle, self.prefetch_values,  +          name="AssignBeforeReadVariable")]):+        result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                            self._dtype)+    else:+      result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                          self._dtype)+    _maybe_set_handle_data(self._dtype, self._handle, result)+```++```python+# `training/optimizer.py++class _DenseDynamicEmbeddingTrainableProcessor(_OptimizableVariable):+  """Processor for TrainableWrap."""++  def update_op(self, optimizer, g):+   +    # The `update_op` is a insert-like operation of `TrainableWrap`+    # which write back the trained values hold Temporarily in primary and slots `TrainableWrap`.+    _apply_op = optimizer._resource_apply_dense(g, self._v)+    with ops.control_dependencies([_apply_op]):+      _update = control_flow_ops.group([self._v.update_op()] + \+                                       [_s.update_op()+                                        for _s in dynamic_embedding_ops.get_slots(self._v)])+    return _update+#  ...++def _get_processor(v):+  """The processor of v."""+  # ...+  if isinstance(v, dynamic_embedding_ops.TrainableWrap):+    return _DenseDynamicEmbeddingTrainableProcessor(v)  # get processor for `TrainableWrap`+  # ...+```++#### High-level APIs of computing (Optional)++* `tf.dynamic_embedding.csr_matmul`+* `tf.dynamic_embedding.coo_matmul`+* `TBD`++#### Other Changes++##### Runtime random initialization++Since sparse training does not allocate memory on train-loops start, sparse weights cannot be initialized statically like we do on `tf.Variable`.++The logic of run-time initialization is needed. The random values will be created when some key are not found by `tf.lookup.MutableHashTable::Find`. The pseudo-code is as follows:++*   tensorflow/core/kernels/lookup_table_op.cc++```cpp+Status MutableHashTableOfTensors::Find(

It's implementation details, but considering function names like FindOrInsert to avoid confusion.

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_xxx` API to train `dynamic_embdding.TrainableWrap` naturally after some changing a little common code:++- `ops/resource_variable_ops.py`: looking up the newest values to memory hold by `TrainableWrap->tensor()` before starting each iteration steps.+- `training/optimizer.py`: creating the slots of `TrainableWarp` when primary `Variable` is an instance of `TrainableWarp`.+- `training/optimizer.py`: writing back the trained weights of primary and slots to  hash tables of `dynamic_embedding.Variable` and after `Optimizer._resource_apply_xxx`.++The pseudo-code is below: ++```python+# ops/resource_variable_ops.py:++class BaseResourceVariable(variables.VariableV1):+  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):+      with ops.control_dependencies([+          gen_resource_variable_ops.assign_variable_op(+            self._handle, self.prefetch_values,  +          name="AssignBeforeReadVariable")]):+        result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                            self._dtype)+    else:+      result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                          self._dtype)+    _maybe_set_handle_data(self._dtype, self._handle, result)+```++```python+# `training/optimizer.py++class _DenseDynamicEmbeddingTrainableProcessor(_OptimizableVariable):+  """Processor for TrainableWrap."""++  def update_op(self, optimizer, g):+   +    # The `update_op` is a insert-like operation of `TrainableWrap`+    # which write back the trained values hold Temporarily in primary and slots `TrainableWrap`.+    _apply_op = optimizer._resource_apply_dense(g, self._v)+    with ops.control_dependencies([_apply_op]):+      _update = control_flow_ops.group([self._v.update_op()] + \+                                       [_s.update_op()+                                        for _s in dynamic_embedding_ops.get_slots(self._v)])+    return _update+#  ...++def _get_processor(v):+  """The processor of v."""+  # ...+  if isinstance(v, dynamic_embedding_ops.TrainableWrap):+    return _DenseDynamicEmbeddingTrainableProcessor(v)  # get processor for `TrainableWrap`+  # ...+```++#### High-level APIs of computing (Optional)++* `tf.dynamic_embedding.csr_matmul`+* `tf.dynamic_embedding.coo_matmul`+* `TBD`++#### Other Changes++##### Runtime random initialization++Since sparse training does not allocate memory on train-loops start, sparse weights cannot be initialized statically like we do on `tf.Variable`.

AFAIK the value of tf.Variable is not initialized statically. Perhaps you mean the shape of the variable has to be static?

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_xxx` API to train `dynamic_embdding.TrainableWrap` naturally after some changing a little common code:++- `ops/resource_variable_ops.py`: looking up the newest values to memory hold by `TrainableWrap->tensor()` before starting each iteration steps.+- `training/optimizer.py`: creating the slots of `TrainableWarp` when primary `Variable` is an instance of `TrainableWarp`.+- `training/optimizer.py`: writing back the trained weights of primary and slots to  hash tables of `dynamic_embedding.Variable` and after `Optimizer._resource_apply_xxx`.++The pseudo-code is below: ++```python+# ops/resource_variable_ops.py:++class BaseResourceVariable(variables.VariableV1):+  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):+      with ops.control_dependencies([+          gen_resource_variable_ops.assign_variable_op(+            self._handle, self.prefetch_values,  +          name="AssignBeforeReadVariable")]):+        result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                            self._dtype)+    else:+      result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                          self._dtype)+    _maybe_set_handle_data(self._dtype, self._handle, result)+```++```python+# `training/optimizer.py++class _DenseDynamicEmbeddingTrainableProcessor(_OptimizableVariable):+  """Processor for TrainableWrap."""++  def update_op(self, optimizer, g):+   +    # The `update_op` is a insert-like operation of `TrainableWrap`+    # which write back the trained values hold Temporarily in primary and slots `TrainableWrap`.+    _apply_op = optimizer._resource_apply_dense(g, self._v)+    with ops.control_dependencies([_apply_op]):+      _update = control_flow_ops.group([self._v.update_op()] + \+                                       [_s.update_op()+                                        for _s in dynamic_embedding_ops.get_slots(self._v)])+    return _update+#  ...++def _get_processor(v):+  """The processor of v."""+  # ...+  if isinstance(v, dynamic_embedding_ops.TrainableWrap):+    return _DenseDynamicEmbeddingTrainableProcessor(v)  # get processor for `TrainableWrap`+  # ...+```++#### High-level APIs of computing (Optional)++* `tf.dynamic_embedding.csr_matmul`+* `tf.dynamic_embedding.coo_matmul`+* `TBD`

My personal advice would be move them to subsequent RFC(s). Otherwise, you should add some details to these APIs, e.g. how do they work with dynamic embedding variables.

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_xxx` API to train `dynamic_embdding.TrainableWrap` naturally after some changing a little common code:++- `ops/resource_variable_ops.py`: looking up the newest values to memory hold by `TrainableWrap->tensor()` before starting each iteration steps.+- `training/optimizer.py`: creating the slots of `TrainableWarp` when primary `Variable` is an instance of `TrainableWarp`.+- `training/optimizer.py`: writing back the trained weights of primary and slots to  hash tables of `dynamic_embedding.Variable` and after `Optimizer._resource_apply_xxx`.++The pseudo-code is below: ++```python+# ops/resource_variable_ops.py:++class BaseResourceVariable(variables.VariableV1):+  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):+      with ops.control_dependencies([+          gen_resource_variable_ops.assign_variable_op(+            self._handle, self.prefetch_values,  +          name="AssignBeforeReadVariable")]):+        result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                            self._dtype)+    else:+      result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                          self._dtype)+    _maybe_set_handle_data(self._dtype, self._handle, result)+```++```python+# `training/optimizer.py++class _DenseDynamicEmbeddingTrainableProcessor(_OptimizableVariable):+  """Processor for TrainableWrap."""++  def update_op(self, optimizer, g):+   +    # The `update_op` is a insert-like operation of `TrainableWrap`+    # which write back the trained values hold Temporarily in primary and slots `TrainableWrap`.+    _apply_op = optimizer._resource_apply_dense(g, self._v)+    with ops.control_dependencies([_apply_op]):+      _update = control_flow_ops.group([self._v.update_op()] + \+                                       [_s.update_op()+                                        for _s in dynamic_embedding_ops.get_slots(self._v)])+    return _update+#  ...++def _get_processor(v):+  """The processor of v."""+  # ...+  if isinstance(v, dynamic_embedding_ops.TrainableWrap):+    return _DenseDynamicEmbeddingTrainableProcessor(v)  # get processor for `TrainableWrap`+  # ...+```++#### High-level APIs of computing (Optional)

"of computing" is redundant.

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,+                            sp_ids,+                            sp_weights,+                            name=None,+                            combiner=None,+                            max_norm=None):+  """Provides a dynamic version of embedding_lookup_sparse+    wich is similar to tf.nn.embedding_lookup_sparse in function.++  Args:    +     All args ars similar to `tf.nn.embedding_lookup_sparse`.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handles of `params` and `ids`.+  """+  pass+```++### APIs of training sparse weights++In the early schema, the `dynamic_embedding.Variable` will directly be trained by optimizers and we had to extand all optimizers one-by-one.++Now we propose a new scheme which no longer require to extand all optimizers one by one: ++- The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and the main function is maintaining the relationship between `params` and `ids` created by `embedding_lookup` or `embedding_lookup_sparse`.++```python+from tensorflow.python.ops import resource_variable_ops++class TrainableWrap(resource_variable_ops.ResourceVariable):+  """+  This class is a trainable warp of dynamic embedding,+  and the key role is recording the map relationship of params and ids.+  inheriting from the ResourceVariable make it trainable.+  """+  def __init__(self, params, ids, *args, **kwargs):+    """Creates an empty `TrainableWrap` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      params: A dynamic_embedding.Variable instance.+      ids: a tensor with any shape as same dtype of params.key_dtype.+      ```other parameters is similar to tf.nn.embedding_lookup.```+    Returns:+      A `TrainableWrap` object which is a subclass of ResourceVariable.+    """+    self.params = params+    self.ids = ids+    self.prefetch_values = self.params.lookup(self.ids)++    super(TrainableWrap, self).__init__(*args, **kwargs)++  def update_op(self):+    return self.params.upsert(self.ids, self._read_variable_op())+```++The scheme will **reuse** the `Optimizer._resource_apply_xxx` API to train `dynamic_embdding.TrainableWrap` naturally after some changing a little common code:++- `ops/resource_variable_ops.py`: looking up the newest values to memory hold by `TrainableWrap->tensor()` before starting each iteration steps.+- `training/optimizer.py`: creating the slots of `TrainableWarp` when primary `Variable` is an instance of `TrainableWarp`.+- `training/optimizer.py`: writing back the trained weights of primary and slots to  hash tables of `dynamic_embedding.Variable` and after `Optimizer._resource_apply_xxx`.++The pseudo-code is below: ++```python+# ops/resource_variable_ops.py:++class BaseResourceVariable(variables.VariableV1):+  """A python variable from an existing handle."""+  def _read_variable_op(self):++    # The `assign_variable_op` will be called to get before read opertion.+    # `prefetch_values` is a lookup-like operation of `TrainableWrap`.++    if hasattr(self, "prefetch_values"):+      with ops.control_dependencies([+          gen_resource_variable_ops.assign_variable_op(+            self._handle, self.prefetch_values,  +          name="AssignBeforeReadVariable")]):+        result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                            self._dtype)+    else:+      result = gen_resource_variable_ops.read_variable_op(self._handle,+                                                          self._dtype)+    _maybe_set_handle_data(self._dtype, self._handle, result)+```++```python+# `training/optimizer.py++class _DenseDynamicEmbeddingTrainableProcessor(_OptimizableVariable):+  """Processor for TrainableWrap."""++  def update_op(self, optimizer, g):+   +    # The `update_op` is a insert-like operation of `TrainableWrap`+    # which write back the trained values hold Temporarily in primary and slots `TrainableWrap`.+    _apply_op = optimizer._resource_apply_dense(g, self._v)+    with ops.control_dependencies([_apply_op]):+      _update = control_flow_ops.group([self._v.update_op()] + \+                                       [_s.update_op()+                                        for _s in dynamic_embedding_ops.get_slots(self._v)])

I cannot find dynamic_embedding_ops in this document or current TF codebase.

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.++  Ids are flattened to a 1d tensor before being passed to embedding_lookup+  then, they are unflattend to match the original ids shape plus an extra+  leading dimension of the size of the embeddings.++  Args:+    params: A dynamic_embedding.Variable instance.+    ids: a tensor with any shape as same dtype of params.key_dtype.+    name: A name for the operation (optional).+    max_norm: If not `None`, each embedding is clipped if its l2-norm is larger+      than this value.+  Returns:+    A `dynamic_embedding.TrainableWarp` which hold the handls of `params` and `ids`.+  """+  assert isinstance(params, Variable), "params should be a Variable instance."+  assert params.key_dtype == ids.dtype, \+    "params.value_dtype should be same with ids.dtype: {} vs. {}". \+      format(params.key_dtype, ids.dtype)+  vals = None+  warp_vals = None+  with ops.name_scope(name):+    init = constant_op.constant(0.0, shape=(1,))+    warp_vals = TrainableWrap(params,+                              ids,+                              initial_value=vals,+                              dtype=params.value_dtype,+                              trainable=params.trainable)+    if max_norm is not None:+      warp_vals = _clip(params.lookup(ids), ids, max_norm)+  return warp_vals++###++@tf_export(v1=["dynamic_embedding.embedding_lookup_sparse"])+def embedding_lookup_sparse(params,

Same here, how about tf.nn.dynamic_embedding_lookup_sparse?

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.+  """++  def __init__(self,+               key_dtype=dtypes.int64,+               value_dtype=dtypes.float32,+               dim=1,+               devices=["/CPU:0", ],+               partitioner=default_partition_fn,+               shared_name=None,+               name="TrainableWrapForDynamicEmbedding",+               initial_value=0.0,+               initial_mode="constant",  # use_default=False, == "constant"+               trainable=True,+               checkpoint=True):++    """Creates an empty `Variable` object.++    Creates a group of tables placed on devices,+    the type of its keys and values are specified by key_dtype+    and value_dtype, respectively.++    Args:+      key_dtype: the type of the key tensors.+      value_dtype: the type of the value tensors.+      dim: the length of the value array for each key.+      devices: the list of devices holding the tables.+        One table will be created on each device.+      partitioner: partition function of keys,+        return the partition index for each key.++      Example partition func:+      ```python+      def default_partition_fn(keys, shard_num):+        return tf.cast(keys % shard_num, dtype=tf.int32)+      ```++      shared_name: If non-empty, the SparseVariable will be shared under+        the given name across multiple sessions.+      name: A name for the operation (optional).+      initial_value: The value to use if a key is missing in the hash table.+      initial_mode: define the behavior when some keys were missing.+        'random': lookup will return the random values+          with normal distribution(mean=0.0, stddev=0.1)+        'constant': lookup will return the valuse+          filled by `initial_value`.+      trainable: True, will be treated as a trainable Variable, and add to+        to the list of variables collected in the graph under the key+        `GraphKeys.TRAINABLE_VARIABLES`.+      checkpoint: if True, the contents of the Variable are+        saved to and restored from checkpoints.+        If `shared_name` is empty for a checkpointed table,+        it is shared using the table node name.++    Returns:+      A `dynamic_embedding.Variable` object.+    """+    self.key_dtype = key_dtype+    self.value_dtype = value_dtype+    self.default_value = initial_value+    self.dim = dim+    self.devices = devices if isinstance(devices, list) else [devices, ]+    self.partition_fn = partitioner+    self.name = name+    self.shared_name = shared_name or "shared_name.{}".format(name)+    self.initial_value = initial_value+    self.initial_mode = initial_mode+    self.use_default = self.initial_mode == 'constant'+    self.trainable = trainable+    self.checkpoint = checkpoint++    self._tables = []+    self.size_ops = []+    self.shard_num = len(self.devices)++    assert initial_mode in ["random", "constant"], \+      "initial mode should be 'constant' or 'random' vs " + initial_mode++    _default_value = _convert_anything_to_init(self.initial_value, self.dim)++    for idx in range(len(self.devices)):+      with ops.device(self.devices[idx]):+        _mht = lookup_ops.MutableHashTable(key_dtype=self.key_dtype,+                                           value_dtype=self.value_dtype,+                                           default_value=_default_value,+                                           use_default=self.use_default,+                                           shared_name="{}-{}of{}".format(+                                             self.shared_name,+                                             idx + 1,+                                             self.shard_num),+                                           name="{}-{}of{}".format(self.name,+                                                                   idx + 1,+                                                                   self.shard_num),+                                           checkpoint=self.checkpoint)++        self._tables.append(_mht)+        self.size_ops.append(self._tables[idx].size())++  def upsert(self, keys, values, name=None):+    """Insert or Update `keys` with `values`.++    If key exists already, value will be updated.++    Args:+      keys: Keys to insert. Can be a tensor of any shape. Must match the table's+        key type.+      values: Values to be associated with keys. Must be a tensor of the same+        shape as `keys` and match the table's value type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def remove(self, keys, name=None):+    """Removes `keys` and its associated values from the variable.++    If a key is not present in the table, it is silently ignored.++    Args:+      keys: Keys to remove. Can be a tensor of any shape. Must match the table's+        key type.+      name: A name for the operation (optional).++    Returns:+      The created Operation.+    """+    pass++  def lookup(self, keys, name=None):+    """Looks up `keys` in a Variable, outputs the corresponding values.++    The `default_value` is used for keys not present in the table.++    Args:+      keys: Keys to look up. Can be a tensor of any shape. Must match the+        table's key_dtype.+      name: A name for the operation (optional).++    Returns:+      A tensor containing the values in the same shape as `keys` using the+        table's value type.+    """+    pass++  def export(self, name=None):+    """Returns tensors of all keys and values in the table.++    Args:+      name: A name for the operation (optional).++    Returns:+      A pair of tensors with the first tensor containing all keys and the+        second tensors containing all values in the table.+    """+    pass++  def size(self, index=None, name=None):+    """Compute the number of elements in the index-th table of this Variable.++    If index is none, the total size of the Variable wil be return.++    Args:+      index:The index of table (optional)+      name: A name for the operation (optional).++    Returns:+      A scalar tensor containing the number of elements in this Variable.+    """+    pass+```++#### Trainable Wrap++We design two user APIs `embedding_lookup` and `embedding_lookup_sparse` to create and return a trainable wrap.++* `tf.dynamic_embedding.embedding_lookup`+* `tf.dynamic_embedding.embedding_lookup_sparse`++which are similar in funcion and input arguments to `tf.nn.embedding_lookup` and `tf.nn.embedding_lookup_sparse`.++```python+@tf_export(v1=["dynamic_embedding.embedding_lookup"])+def embedding_lookup(params,+                     ids,+                     name='embedding_lookup',+                     max_norm=None):+  """Provides a dynamic version of embedding_lookup+    similar to tf.nn.embedding_lookup.

I am not sure if this deserves another top-level package name. How about tf.nn.dynamic_embedding_lookup?

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])+class Variable(object):+  """+  A Distributed version of HashTable(reference from lookup_ops.MutableHashTable).+  It is designed to dynamically store the Sparse Weights(Parameters) of DLRMs.

Should be "weights of sparse models such as DLRM"

rhdong

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training.

+# Sparse Domain Isolation for supporting large-scale Recommender Systems.++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Haidong Rong (hudsonrong@tencent.com) Yafei Zhang(kimmyzhang@tencent.com) Jiandong Wang(adnywang@tencent.com) |+| **Sponsor**   | Yuefeng Zhou (yuefengz@google.com)                   |+| **Updated**   | 2020-05-23                                           |++## Background++In recent years, many deep learning components (such as FC, RNN, etc.) and online-learning policy were introduced to the recommender systems that result in the special framework for recommender system (such as DiFacto, etc.) gradually unable to meet the demand. More and more algorithm engineers turn to use open-source general frameworks(such as Tensorflow, etc.)  to improve production efficiency.+In the industrial recommendation scenarios, the parameters are high-dimensional sparse with the following characteristics:+1. The theoretical total number of dynamic features is 2^64, the size of sparse weights is more than 100GB or even several TB which make reserving memory at the beginning too expensive;+2. The online-learning tracks user interests in real time and requires eliminating and adding weights on running-time without frequently restart training;++![Models Weights Compare](20200424-sparse-domain-isolation/models-weights-compare.png)++## Objective++TensorFlow supports large-scale dynamic sparse weights training.++## Problem & Motivation++TensorFlow is well designed, but not support creating and training the large-scale dynamic sparse weights. The main obstacles are:+1. The `tf.Variable` is fixed size on running time and saved in continuous memory and unable to add or remove weights dynamically.+2. The typical training mode of full updating in TF is very inefficient when training xTB models.+3. Usually, the sparse weights are saved in hash table form (K-V pairs for sparse feature ID and sparse weights) but the optimizers of TF cann't train directly sparse weights and accumulators(slots) saved in hash tables.++For the above reasons, the official version of TensorFlow can only be used for offline model hyperparameter tuning with fixed size features, and the model can not exceed 100GB generally. The application in the industrial scene is very limited. At present, We customized TensorFlow and successfully applied it to large-scale sparse features training of industrial recommendation scene. We hope to contribute this work of general components to the community.++## Design Proposal++### The ideal semantics:++```python+import tensorflow.dynamic_embedding as dye++# sparse weights defination:+w = dye.get_variable(name="dynamic_embeddings",+                     devices=["/job:ps/replica:0/task:0/CPU:0",],+                     dim=8)+z = dye.embedding_lookup(params=w,+                         ids=x, # x is got from samples.+                         name="wide-sparse-weights")+                         # z should be a trainable and variable-size instance.+# forward graph++loss = tf.XXX(...)++# backward graph+opt = tf.train.AdamOptimizer(learning_rate=0.1)+update_op = opt.minimize(loss)+```++## Sparse Domain Isolation++### Overview of Design+![Overview Flow Chart](20200424-sparse-domain-isolation/overview-flow-chart.png)++### Design Considerations++![Architecture of SDI](20200424-sparse-domain-isolation/architecture.png)+*   The `tf.Tensor` is still suitable for holding iteration weights, because only a small part of weights will be updated in each iterations.++![Expression Weights Hierarchically](20200424-sparse-domain-isolation/expression-weights-hierarchically.png)+*   Minimize changing on Tensorflow core and try to use native components.++#### Hash Table++Refer to `MutableHashTable` implemented in `tf.lookup`. The reason is as follows:++*   The `MutableHashTable` meets the basic requirements of sparse weights.+*   The `tf.lookup` is well designed and easy to maintain.+*   The save/restore from/to checkpoint is already supported.+*   Compatible with all kind of distributed strategy(parameters server.)+++### Detailed Design and Implementation++The trainable warp class `dynamic_embdding.TrainableWrap` inherted from `ResourceVariable` will be introduced, and help us reuse all optimizers.++![detail of SDI](20200424-sparse-domain-isolation/optimizers-reuse-scheme.png)++### APIs Overview++* Name Space+    * `tf.dynamic_embedding` # core version+    +* Creating Weights+    * `tf.dynamic_embedding.Variable`+    * `tf.dynamic_embedding.get_variable`++* Trainable Wrap+    * `tf.dynamic_embedding.TrainableWrap`+    * `tf.dynamic_embedding.embedding_lookup`+    * `tf.dynamic_embedding.embedding_lookup_sparse`++* High Level APIs(Optional) +    * `tf.dynamic_embedding.csr_matmul`+    * `tf.dynamic_embedding.coo_matmul`+    ++#### Name Space++All of the APIs are implemented in `tf.dynamic_embedding` package.++```python+import tensorflow as tf # core version+import tensorflow_recsys as tfrs # SIG version++weights = tf.dynamic_embedding.get_variable(*args, **kwargs)+```++#### Creating Sparse Weights++We design the `dynamic_embedding.Variable` based on the `tf.lookup.MutableHashTable` and make it support distributed scenario, there is no need to develop any new operators for the hash table. +The `dynamic_embedding.Variable` backed up by a group of hashtables is responsible for actually holding the memory resource of sparse weights.++* `tf.dynamic_embedding.Variable`+    * `export`+    * `remove`+    * `upsert` (update if exist else insert) +    * `size`+* `tf.dynamic_embedding.get_variable`++```python+@tf_export(v1=["dynamic_embedding.Variable"])

TF 2.x API should be first class citizen.

rhdong

comment created time in 2 months

issue commenttensorflow/tensorflow

bazel build failed

Try bazel clean --expunge.

LLdleo

comment created time in 2 months

issue commenttensorflow/runtime

distributed runtime

It’s not limited to TCP/IP in the sense that the communication protocol could be made pluggable in the distributed runtime.

powderluv

comment created time in 2 months

issue commenttensorflow/tensorflow

tensorflow-io import error caused by "com_google_absl"

Don't replace the TF-specified abseil-cpp with any other release because it will not work. It took me a month full time to make it happen.

That's truly remarkable. Thanks for the effort!

CuiYifeng

comment created time in 2 months

issue commenttensorflow/tensorflow

tensorflow-io import error caused by "com_google_absl"

TF doesn't use abseil-cpp lts_2020_02_25, but git commit df3ea785d8c30a9503321a3d35ee7d35808f190d (labeled "LTS 2020.02.25 Patch 1") with a patch around string_view. Don't replace the TF-specified abseil-cpp with any other release because it will not work. It took me a month full time to make it happen.

Also,

$ bazel test :substr_op_test  # macOS Catalina v10.15.4
[ ... omitted 9001 compilation warnings ... ]
//tensorflow/python/kernel_tests:substr_op_test                          PASSED in 10.2s

so you're using an official variant of TF, is that correct?

I am using the RBE builds with environment here and toolchain definitions here. Manually adding --copt=-DNDEBUG solves my issue there, so I guess it's not really related to the absl update.

CuiYifeng

comment created time in 2 months

starteduber-go/ratelimit

started time in 2 months

Pull request review commenttensorflow/community

RFC: TensorFloat-32 support in TensorFlow

+# TensorFloat-32 in TensorFlow++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [247](https://github.com/tensorflow/community/pull/247) |+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com)             |+| **Sponsor**   | Sanjoy Das (sanjoy@google.com)                 |+| **Updated**   | 2020-05-20                                           |++## Objective++Allow [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format) to be used in TensorFlow to improve performance.++## Motivation++[NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), an upcoming generation of NVIDIA GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short.+TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa).+It is not an in-memory format, but tensor cores natively support it as a computation format.+TF32 should not be thought of as an in-memory dtype but instead a computation mode that increases performance and decreases numeric precision for certain float32 operations.+NVIDIA has not found any cases where TF32 reduces the convergence of deep learning models.++Upcoming versions of cuDNN, cuBLAS, and other CUDA libraries will expose a mode of execution that has float32 inputs and outputs, but internally truncates float32 to TF32 and uses tensor cores.  This is expected to be sufficiently accurate to reach the same convergence as the “full” float32 mode of execution but significantly faster.  Each element still takes four bytes, so there is still a memory and performance penalty compared to using float16 or bfloat16.++As TF32 is only usable by tensor cores, it can only be used for matrix multiplications and other ops implemented in terms of matrix multiplications, such as convolutions. It is not used for pointwise ops or reductions.++TF32 will benefit users who run float32 models on Ampere GPUs, so we need an API to allow these users to enable TF32. ++## Design Proposal++In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call:++```python+tf.config.allow_tensor_float_32_execution(True)+```++The word "allow" emphasizes only certain devices (Ampere GPUs) and ops (such as matmuls and convolutions) will be affected. Once enabled, all local and remote Ampere GPUs use TF32 for supported float32 ops.++Passing `False` to `allow_tensor_float_32_execution` will disable TF32 if already enabled.++We call the function "allow_tensor_float_32_execution" instead of the more concise "allow_tf32_execution" because people may mistakenly interpret the phrase "tf32" to refer to TensorFlow instead of TensorFloat. ++The following can be used to query whether TF32 is enabled. The function returns a bool.++```python+tf.config.tensor_float_32_execution_allowed()+```++Since TF32 only affects Ampere GPUs, moving an op to a GPU can affect numerics. Grappler and other graph optimizations will not consider this, and will freely move ops between devices without regard to numeric stability. As a result, explicitly putting an op on the CPU does not ensure it will use the full float32 precision instead of TF32.++Since TensorFlow 2.3 will not support CUDA 11, which is required for TF32, this API will first be exposed in TensorFlow 2.4. However, Google Cloud will likely cherrypick CUDA 11 and this API into their version of 2.3, so they can offer TF32 support to their customers who use TensorFlow 2.3.+++### Turning TF32 on by default++Numerical studies by NVIDIA covering many common models suggest that TF32 is numerically robust for deep learning applications. In order to take advantage of these new accelerations in Ampere hardware for float32 models, we would like to enable TF32 by default. However, since the TensorFlow 2.4 release is still months away and we intend to use that time to further test and evaluate TF32, it is too early to decide in this RFC whether TF32 execution will be enabled or disabled by default. Here we begin a discussion by listing the most likely scenarios. Comments are also welcome. The scenarios are:++1. Turn it on by default in 2.4, the first release with the TF32 API.+2. Turn it on by default in 2.5, the second release with the TF32 API.+3. Do not turn it on by default.+++The advantage of (1) is that all Ampere float32 users get the performance benefit unless they opt out. Additionally, Ampere numerics will not be loosened in a new release: TensorFlow 2.4 will be the first release with Ampere support, and it will immediately default to TF32 being enabled. The disadvantage is that we cannot collect as much feedback from users before defaulting to TF32, because no stable version of TensorFlow will support TF32 but not have it enabled by default.++The advantage of (2) is that it allows users to test and give feedback on TF32 with a stable version of TensorFlow before we decide whether it should be default. The disadvantage is it’s possible we break Ampere users who relied on the full float32 precision in 2.4 when they upgrade to 2.5++The advantage of (3) is that a user’s model will never break due to using reduced precision, even if they upgrade from an earlier GPU to Ampere. The disadvantage is that many Ampere users would not get the performance benefit from TF32 as they would not know about the API to enable it.++Another advantage of turning on TF32 by default is that it makes TensorFlow’s behavior with GPUs more consistent with TPUs. TPUs internally use lower precision for float32 matmuls and convolutions, similar to how Ampere GPUs will use lower precision for float32 matmuls and convolutions if TF32 is enabled.++**If you know of any models whose accuracy may be impacted by TF32, please comment on this RFC.** Note that TF32 is equivalent to float32 except it has 10 bits of mantissa instead of 23 bits. It will initially be used only for matmuls and convolutions, but may be used for other ops in the future if they are implemented in terms of a matmul. Once TensorFlow 2.4 is released, you will be able to test the impact of TF32 on your models if you have Ampere GPUs. You will be able to test earlier if you use Tensorflow nightly packages, and even earlier if you build from source with CUDA 11 support.++### Remote devices++Enabling TF32 will affect remote Ampere GPUs in addition to local Ampere GPUs. In particular, it will affect devices on hosts connected to via [`tf.config.experimental_connect_to_host`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_host) or [`tf.config.experimental_connect_to_cluster`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_cluster). The initial, unexposed version of the function in TensorFlow 2.3 may only support local devices, not remote devices, if we do not have time to implement remote device support.

So it will not be part of the cluster_device_attributes, but a new field? (Also see my comment below on updating remote context)

reedwm

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: TensorFloat-32 support in TensorFlow

+# TensorFloat-32 in TensorFlow++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [247](https://github.com/tensorflow/community/pull/247) |+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com)             |+| **Sponsor**   | Sanjoy Das (sanjoy@google.com)                 |+| **Updated**   | 2020-05-20                                           |++## Objective++Allow [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format) to be used in TensorFlow to improve performance.++## Motivation++[NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), an upcoming generation of NVIDIA GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short.+TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa).+It is not an in-memory format, but tensor cores natively support it as a computation format.+TF32 should not be thought of as an in-memory dtype but instead a computation mode that increases performance and decreases numeric precision for certain float32 operations.+NVIDIA has not found any cases where TF32 reduces the convergence of deep learning models.++Upcoming versions of cuDNN, cuBLAS, and other CUDA libraries will expose a mode of execution that has float32 inputs and outputs, but internally truncates float32 to TF32 and uses tensor cores.  This is expected to be sufficiently accurate to reach the same convergence as the “full” float32 mode of execution but significantly faster.  Each element still takes four bytes, so there is still a memory and performance penalty compared to using float16 or bfloat16.++As TF32 is only usable by tensor cores, it can only be used for matrix multiplications and other ops implemented in terms of matrix multiplications, such as convolutions. It is not used for pointwise ops or reductions.++TF32 will benefit users who run float32 models on Ampere GPUs, so we need an API to allow these users to enable TF32. ++## Design Proposal++In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call:++```python+tf.config.allow_tensor_float_32_execution(True)+```++The word "allow" emphasizes only certain devices (Ampere GPUs) and ops (such as matmuls and convolutions) will be affected. Once enabled, all local and remote Ampere GPUs use TF32 for supported float32 ops.++Passing `False` to `allow_tensor_float_32_execution` will disable TF32 if already enabled.

Does the UpdateContext RPC support this particular use case? I imagine one could use the cluster_device_attributes field to pass on the TF32 mode toggle, but it looks like it will update the remote session when cluster_device_attributes is not empty in current codebase. Not sure if setting TF32 worth carrying out such a heavy(?) operation.

reedwm

comment created time in 2 months

issue commenttensorflow/tensorflow

tensorflow-io import error caused by "com_google_absl"

Also cc @yongtang per this issue

CuiYifeng

comment created time in 2 months

issue commenttensorflow/tensorflow

tensorflow-io import error caused by "com_google_absl"

Another issue caused by commit 6c7e338ae7f0b0f2e224319de7e2165141c148fb:

FAIL: //tensorflow/python/kernel_tests:substr_op_test (see /home/byronyi/.cache/bazel/_bazel_tiger/af81e00f6f55d49334af721d67cec08e/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/kernel_tests/substr_op_test/test.log)

FAILED: //tensorflow/python/kernel_tests:substr_op_test (Summary)
      /home/byronyi/.cache/bazel/_bazel_tiger/af81e00f6f55d49334af721d67cec08e/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/kernel_tests/substr_op_test/test.log
      /home/byronyi/.cache/bazel/_bazel_tiger/af81e00f6f55d49334af721d67cec08e/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/kernel_tests/substr_op_test/test_attempts/attempt_1.log
      /home/byronyi/.cache/bazel/_bazel_tiger/af81e00f6f55d49334af721d67cec08e/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/python/kernel_tests/substr_op_test/test_attempts/attempt_2.log
INFO: From Testing //tensorflow/python/kernel_tests:substr_op_test:
==================== Test output for //tensorflow/python/kernel_tests:substr_op_test:
Running tests under Python 3.5.9: /usr/local/bin/python3.5
[ RUN      ] SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'BYTE')
INFO:tensorflow:time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'BYTE')): 0.02s
I0521 18:41:29.769983 140606904841984 test_util.py:1970] time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'BYTE')): 0.02s
[       OK ] SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'BYTE')
[ RUN      ] SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'BYTE')
INFO:tensorflow:time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'BYTE')): 0.0s
I0521 18:41:29.771886 140606904841984 test_util.py:1970] time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'BYTE')): 0.0s
[       OK ] SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'BYTE')
[ RUN      ] SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'UTF8_CHAR')
INFO:tensorflow:time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'UTF8_CHAR')): 0.0s
I0521 18:41:29.773532 140606904841984 test_util.py:1970] time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'UTF8_CHAR')): 0.0s
[       OK ] SubstrOpTest.testBadBroadcast(<class 'numpy.int32'>, 'UTF8_CHAR')
[ RUN      ] SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'UTF8_CHAR')
INFO:tensorflow:time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'UTF8_CHAR')): 0.0s
I0521 18:41:29.775230 140606904841984 test_util.py:1970] time(__main__.SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'UTF8_CHAR')): 0.0s
[       OK ] SubstrOpTest.testBadBroadcast(<class 'numpy.int64'>, 'UTF8_CHAR')
[ RUN      ] SubstrOpTest.testBroadcast(<class 'numpy.int32'>, 'BYTE')
2020-05-21 18:41:29.775809: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance-critical operations:  AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-05-21 18:41:29.806852: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2400000000 Hz
2020-05-21 18:41:29.812239: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x40c7660 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-21 18:41:29.812276: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:time(__main__.SubstrOpTest.testBroadcast(<class 'numpy.int32'>, 'BYTE')): 0.05s
I0521 18:41:29.820772 140606904841984 test_util.py:1970] time(__main__.SubstrOpTest.testBroadcast(<class 'numpy.int32'>, 'BYTE')): 0.05s
[       OK ] SubstrOpTest.testBroadcast(<class 'numpy.int32'>, 'BYTE')
[ RUN      ] SubstrOpTest.testBroadcast(<class 'numpy.int64'>, 'BYTE')
INFO:tensorflow:time(__main__.SubstrOpTest.testBroadcast(<class 'numpy.int64'>, 'BYTE')): 0.0s
I0521 18:41:29.822387 140606904841984 test_util.py:1970] time(__main__.SubstrOpTest.testBroadcast(<class 'numpy.int64'>, 'BYTE')): 0.0s
[       OK ] SubstrOpTest.testBroadcast(<class 'numpy.int64'>, 'BYTE')
[ RUN      ] SubstrOpTest.testBroadcast(<class 'numpy.int32'>, 'UTF8_CHAR')
python3.5: external/com_google_absl/absl/strings/string_view.h:292: absl::lts_2020_02_25::string_view::operator[](absl::lts_2020_02_25::string_view::size_type) const::<lambda()>: Assertion `false && "i < size()"' failed.
Fatal Python error: Aborted

cc @impjdi

CuiYifeng

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: TensorFloat-32 support in TensorFlow

+# TensorFloat-32 in TensorFlow++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [247](https://github.com/tensorflow/community/pull/247) |+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com)             |+| **Sponsor**   | Sanjoy Das (sanjoy@google.com)                 |+| **Updated**   | 2020-05-20                                           |++## Objective++Allow [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format) to be used in TensorFlow to improve performance.++## Motivation++[NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), an upcoming generation of NVIDIA GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short.+TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa).+It is not an in-memory format, but tensor cores natively support it as a computation format.+TF32 should not be thought of as an in-memory dtype but instead a computation mode that increases performance and decreases numeric precision for certain float32 operations.+NVIDIA has not found any cases where TF32 reduces the convergence of deep learning models.++Upcoming versions of cuDNN, cuBLAS, and other CUDA libraries will expose a mode of execution that has float32 inputs and outputs, but internally truncates float32 to TF32 and uses tensor cores.  This is expected to be sufficiently accurate to reach the same convergence as the “full” float32 mode of execution but significantly faster.  Each element still takes four bytes, so there is still a memory and performance penalty compared to using float16 or bfloat16.++As TF32 is only usable by tensor cores, it can only be used for matrix multiplications and other ops implemented in terms of matrix multiplications, such as convolutions. It is not used for pointwise ops or reductions.++TF32 will benefit users who run float32 models on Ampere GPUs, so we need an API to allow these users to enable TF32. ++## Design Proposal++In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call:++```python+tf.config.allow_tensor_float_32_execution(True)+```++The word "allow" emphasizes only certain devices (Ampere GPUs) and ops (such as matmuls and convolutions) will be affected. Once enabled, all local and remote Ampere GPUs use TF32 for supported float32 ops.++Passing `False` to `allow_tensor_float_32_execution` will disable TF32 if already enabled.++We call the function "allow_tensor_float_32_execution" instead of the more concise "allow_tf32_execution" because people may mistakenly interpret the phrase "tf32" to refer to TensorFlow instead of TensorFloat. ++The following can be used to query whether TF32 is enabled. The function returns a bool.++```python+tf.config.tensor_float_32_execution_allowed()+```++Since TF32 only affects Ampere GPUs, moving an op to a GPU can affect numerics. Grappler and other graph optimizations will not consider this, and will freely move ops between devices without regard to numeric stability. As a result, explicitly putting an op on the CPU does not ensure it will use the full float32 precision instead of TF32.++Since TensorFlow 2.3 will not support CUDA 11, which is required for TF32, this API will first be exposed in TensorFlow 2.4. However, Google Cloud will likely cherrypick CUDA 11 and this API into their version of 2.3, so they can offer TF32 support to their customers who use TensorFlow 2.3.+++### Turning TF32 on by default++Numerical studies by NVIDIA covering many common models suggest that TF32 is numerically robust for deep learning applications. In order to take advantage of these new accelerations in Ampere hardware for float32 models, we would like to enable TF32 by default. However, since the TensorFlow 2.4 release is still months away and we intend to use that time to further test and evaluate TF32, it is too early to decide in this RFC whether TF32 execution will be enabled or disabled by default. Here we begin a discussion by listing the most likely scenarios. Comments are also welcome. The scenarios are:++1. Turn it on by default in 2.4, the first release with the TF32 API.+2. Turn it on by default in 2.5, the second release with the TF32 API.+3. Do not turn it on by default.+++The advantage of (1) is that all Ampere float32 users get the performance benefit unless they opt out. Additionally, Ampere numerics will not be loosened in a new release: TensorFlow 2.4 will be the first release with Ampere support, and it will immediately default to TF32 being enabled. The disadvantage is that we cannot collect as much feedback from users before defaulting to TF32, because no stable version of TensorFlow will support TF32 but not have it enabled by default.++The advantage of (2) is that it allows users to test and give feedback on TF32 with a stable version of TensorFlow before we decide whether it should be default. The disadvantage is it’s possible we break Ampere users who relied on the full float32 precision in 2.4 when they upgrade to 2.5++The advantage of (3) is that a user’s model will never break due to using reduced precision, even if they upgrade from an earlier GPU to Ampere. The disadvantage is that many Ampere users would not get the performance benefit from TF32 as they would not know about the API to enable it.++Another advantage of turning on TF32 by default is that it makes TensorFlow’s behavior with GPUs more consistent with TPUs. TPUs internally use lower precision for float32 matmuls and convolutions, similar to how Ampere GPUs will use lower precision for float32 matmuls and convolutions if TF32 is enabled.++**If you know of any models whose accuracy may be impacted by TF32, please comment on this RFC.** Note that TF32 is equivalent to float32 except it has 10 bits of mantissa instead of 23 bits. It will initially be used only for matmuls and convolutions, but may be used for other ops in the future if they are implemented in terms of a matmul. Once TensorFlow 2.4 is released, you will be able to test the impact of TF32 on your models if you have Ampere GPUs. You will be able to test earlier if you use Tensorflow nightly packages, and even earlier if you build from source with CUDA 11 support.

+1 for the transparency.

reedwm

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: TensorFloat-32 support in TensorFlow

+# TensorFloat-32 in TensorFlow++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [247](https://github.com/tensorflow/community/pull/247) |+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com)             |+| **Sponsor**   | Sanjoy Das (sanjoy@google.com)                 |+| **Updated**   | 2020-05-20                                           |++## Objective++Allow [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format) to be used in TensorFlow to improve performance.++## Motivation++[NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), an upcoming generation of NVIDIA GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short.+TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa).+It is not an in-memory format, but tensor cores natively support it as a computation format.+TF32 should not be thought of as an in-memory dtype but instead a computation mode that increases performance and decreases numeric precision for certain float32 operations.+NVIDIA has not found any cases where TF32 reduces the convergence of deep learning models.++Upcoming versions of cuDNN, cuBLAS, and other CUDA libraries will expose a mode of execution that has float32 inputs and outputs, but internally truncates float32 to TF32 and uses tensor cores.  This is expected to be sufficiently accurate to reach the same convergence as the “full” float32 mode of execution but significantly faster.  Each element still takes four bytes, so there is still a memory and performance penalty compared to using float16 or bfloat16.++As TF32 is only usable by tensor cores, it can only be used for matrix multiplications and other ops implemented in terms of matrix multiplications, such as convolutions. It is not used for pointwise ops or reductions.++TF32 will benefit users who run float32 models on Ampere GPUs, so we need an API to allow these users to enable TF32. ++## Design Proposal++In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call:++```python+tf.config.allow_tensor_float_32_execution(True)+```++The word "allow" emphasizes only certain devices (Ampere GPUs) and ops (such as matmuls and convolutions) will be affected. Once enabled, all local and remote Ampere GPUs use TF32 for supported float32 ops.++Passing `False` to `allow_tensor_float_32_execution` will disable TF32 if already enabled.++We call the function "allow_tensor_float_32_execution" instead of the more concise "allow_tf32_execution" because people may mistakenly interpret the phrase "tf32" to refer to TensorFlow instead of TensorFloat. ++The following can be used to query whether TF32 is enabled. The function returns a bool.++```python+tf.config.tensor_float_32_execution_allowed()+```++Since TF32 only affects Ampere GPUs, moving an op to a GPU can affect numerics. Grappler and other graph optimizations will not consider this, and will freely move ops between devices without regard to numeric stability. As a result, explicitly putting an op on the CPU does not ensure it will use the full float32 precision instead of TF32.++Since TensorFlow 2.3 will not support CUDA 11, which is required for TF32, this API will first be exposed in TensorFlow 2.4. However, Google Cloud will likely cherrypick CUDA 11 and this API into their version of 2.3, so they can offer TF32 support to their customers who use TensorFlow 2.3.+++### Turning TF32 on by default++Numerical studies by NVIDIA covering many common models suggest that TF32 is numerically robust for deep learning applications. In order to take advantage of these new accelerations in Ampere hardware for float32 models, we would like to enable TF32 by default. However, since the TensorFlow 2.4 release is still months away and we intend to use that time to further test and evaluate TF32, it is too early to decide in this RFC whether TF32 execution will be enabled or disabled by default. Here we begin a discussion by listing the most likely scenarios. Comments are also welcome. The scenarios are:++1. Turn it on by default in 2.4, the first release with the TF32 API.+2. Turn it on by default in 2.5, the second release with the TF32 API.+3. Do not turn it on by default.+++The advantage of (1) is that all Ampere float32 users get the performance benefit unless they opt out. Additionally, Ampere numerics will not be loosened in a new release: TensorFlow 2.4 will be the first release with Ampere support, and it will immediately default to TF32 being enabled. The disadvantage is that we cannot collect as much feedback from users before defaulting to TF32, because no stable version of TensorFlow will support TF32 but not have it enabled by default.++The advantage of (2) is that it allows users to test and give feedback on TF32 with a stable version of TensorFlow before we decide whether it should be default. The disadvantage is it’s possible we break Ampere users who relied on the full float32 precision in 2.4 when they upgrade to 2.5++The advantage of (3) is that a user’s model will never break due to using reduced precision, even if they upgrade from an earlier GPU to Ampere. The disadvantage is that many Ampere users would not get the performance benefit from TF32 as they would not know about the API to enable it.++Another advantage of turning on TF32 by default is that it makes TensorFlow’s behavior with GPUs more consistent with TPUs. TPUs internally use lower precision for float32 matmuls and convolutions, similar to how Ampere GPUs will use lower precision for float32 matmuls and convolutions if TF32 is enabled.++**If you know of any models whose accuracy may be impacted by TF32, please comment on this RFC.** Note that TF32 is equivalent to float32 except it has 10 bits of mantissa instead of 23 bits. It will initially be used only for matmuls and convolutions, but may be used for other ops in the future if they are implemented in terms of a matmul. Once TensorFlow 2.4 is released, you will be able to test the impact of TF32 on your models if you have Ampere GPUs. You will be able to test earlier if you use Tensorflow nightly packages, and even earlier if you build from source with CUDA 11 support.++### Remote devices++Enabling TF32 will affect remote Ampere GPUs in addition to local Ampere GPUs. In particular, it will affect devices on hosts connected to via [`tf.config.experimental_connect_to_host`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_host) or [`tf.config.experimental_connect_to_cluster`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_cluster). The initial, unexposed version of the function in TensorFlow 2.3 may only support local devices, not remote devices, if we do not have time to implement remote device support.

Any specific additional efforts needed here to support remote devices?

reedwm

comment created time in 2 months

Pull request review commenttensorflow/community

RFC: TensorFloat-32 support in TensorFlow

+# TensorFloat-32 in TensorFlow++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [247](https://github.com/tensorflow/community/pull/247) |+| **Author(s)** | Reed Wanderman-Milne (reedwm@google.com)             |+| **Sponsor**   | Sanjoy Das (sanjoy@google.com)                 |+| **Updated**   | 2020-05-20                                           |++## Objective++Allow [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format) to be used in TensorFlow to improve performance.++## Motivation++[NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), an upcoming generation of NVIDIA GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short.+TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa).+It is not an in-memory format, but tensor cores natively support it as a computation format.+TF32 should not be thought of as an in-memory dtype but instead a computation mode that increases performance and decreases numeric precision for certain float32 operations.+NVIDIA has not found any cases where TF32 reduces the convergence of deep learning models.++Upcoming versions of cuDNN, cuBLAS, and other CUDA libraries will expose a mode of execution that has float32 inputs and outputs, but internally truncates float32 to TF32 and uses tensor cores.  This is expected to be sufficiently accurate to reach the same convergence as the “full” float32 mode of execution but significantly faster.  Each element still takes four bytes, so there is still a memory and performance penalty compared to using float16 or bfloat16.++As TF32 is only usable by tensor cores, it can only be used for matrix multiplications and other ops implemented in terms of matrix multiplications, such as convolutions. It is not used for pointwise ops or reductions.++TF32 will benefit users who run float32 models on Ampere GPUs, so we need an API to allow these users to enable TF32. ++## Design Proposal++In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call:++```python+tf.config.allow_tensor_float_32_execution(True)

Why not using the Keras mixed precision policy API?

reedwm

comment created time in 2 months

starteddionhaefner/pyhpc-benchmarks

started time in 2 months

issue commenttensorflow/tensorflow

[Feature Request] Dynamic RPC Address Resolution

@saeta Somehow I think this topic might worth revisiting with the new eager runtime. In remote eager client like the VM connect to remote GPU or CloudTPU, one does connect_to_cluster and place the default context on the remote worker. If the eager service crashes, it might not be easy for the user to recover by re-running their computations which could be notebook cells and evaluated without explicit checkpoints. Any chance we could support eager context switch or checkpoint on the service side that is transparent to the client?

make1980

comment created time in 2 months

startedmicrosoft/onnxruntime

started time in 2 months

startedherumi/xbyak

started time in 2 months

startedkeras-team/keras-cv

started time in 2 months

delete branch byronyi/tensorflow

delete branch : r2.1

delete time in 2 months

push eventbyronyi/tensorflow

A. Unique TensorFlower

commit sha 1d84d3ba47a626de6f0a231bf3b565b9ea5354c7

Remove direct dependency on the static libcudart; it is now linked dynamically via the stub everywhere. PiperOrigin-RevId: 301974345 Change-Id: I041786954d8aaa22bf76fdeeab48b08fbe7c2ec0

view details

Bairen Yi

commit sha 0697dd71e75c604d47509a7a6c307ce7112cd4ea

build: Remove cudart static library as it is needed nowhere Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

Bairen Yi

commit sha 7bff6feee43cd5afc0e563c854088060147dd43e

Update build.sh Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

push time in 2 months

create barnchbyronyi/tensorflow

branch : r2.1

created branch time in 2 months

issue commenttensorflow/tensorflow

Release GPU Memory(VRAM) after tf.keras.backend.clear_session()

It is currently not possible without exiting the Python process due to the fact that many TF internal objects, e.g. GPU memory pool, device streams, do not support clean shutdown.

Arktius

comment created time in 2 months

issue commenttensorflow/tensorflow

Efficient allreduce is not supported Issue

IndexedSlice are sparse data structures, so TF is using allgather for AllReduce. It probably doesn’t matter if you are only using 2 GPUs, but you could consider convert sparse tensor to a dense one for better AllReduce support.

cc @dubey

Nixon59-lab

comment created time in 2 months

push eventtensorflow/networking

Bairen Yi

commit sha 4333317880e6edbe23fc5289b48d629ad04d4ec0

grpc+seastar: Add grpc+seastar protocol for WorkerService Imported from https://github.com/tensorflow/tensorflow/pull/27454 Authored-by: LIU Tongxuan <tongxuan.ltx@alibaba-inc.com> Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

tongxuan.ltx

commit sha 76b77cd6c761f2f414010ba224dc91005cc021c2

Remove LGPL header only thirdparty: xfs, gnutls, systemtap-sdt, uuid, sctp

view details

Bairen Yi

commit sha a2cfbcc8712d88f33efa9b07d18ca3dd29f3b202

Revert "Remove LGPL header only thirdparty: xfs, gnutls, systemtap-sdt, uuid, sctp" This reverts commit 96952a2842901202fec6382e4e0db0dba3fec712. Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

Bairen Yi

commit sha e1c7ae09923293f9cce5eba0a159e75993b793be

Merge pull request #21 from byronyi/grpc+seastar grpc+seastar: add grpc+seastar protocol

view details

push time in 2 months

PR merged tensorflow/networking

grpc+seastar: add grpc+seastar protocol

Imported from https://github.com/tensorflow/tensorflow/pull/30168, re-submitted as a networking plugin.

Courtesy @liutongxuan @shanshanpt @YongCHN @lilbedwin from Alibaba PAI.

Changes since last submission:

  1. Rebased seastar to upstream release 19.06, instead of a bundled fork from AlibabaPAI/seastar.
  2. Un-bundled boost, c-ares, fmtlib, ragel, SCTP, GnuTLS, XFS, UUID, sdt, cryptopp, lz4, yaml-cpp, and readerwriterqueue. These third-party libraries are now explicit (transitive) dependencies of grpc+seastar.

Design doc RFC: https://docs.google.com/document/d/1f1m-98rbH33WE0qNb3tP0yt9Jjbb-rprvweLobRbTCA

+6291 -25

27 comments

56 changed files

byronyi

pr closed time in 2 months

startedjart/gosip

started time in 2 months

pull request commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

@liutongxuan @gunan I've squashed the commits, and I revert the patch that disables LGPL-2.1 deps, so we have patch that only makes necessary changes to Seastar.

Let me know if you have found any issues. Thanks!

byronyi

comment created time in 2 months

push eventbyronyi/networking

Bairen Yi

commit sha 4333317880e6edbe23fc5289b48d629ad04d4ec0

grpc+seastar: Add grpc+seastar protocol for WorkerService Imported from https://github.com/tensorflow/tensorflow/pull/27454 Authored-by: LIU Tongxuan <tongxuan.ltx@alibaba-inc.com> Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

tongxuan.ltx

commit sha 76b77cd6c761f2f414010ba224dc91005cc021c2

Remove LGPL header only thirdparty: xfs, gnutls, systemtap-sdt, uuid, sctp

view details

Bairen Yi

commit sha a2cfbcc8712d88f33efa9b07d18ca3dd29f3b202

Revert "Remove LGPL header only thirdparty: xfs, gnutls, systemtap-sdt, uuid, sctp" This reverts commit 96952a2842901202fec6382e4e0db0dba3fec712. Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

push time in 2 months

issue commentbazelbuild/bazel

Allow selecting different python paths for different platforms

Sure, I will open an issue on TF side.

r4nt

comment created time in 2 months

issue commentfacebookresearch/detectron2

AssertionError: cuda is not available after installation

Fixed internally. Kinda embarrassing...

tengerye

comment created time in 2 months

startedmortennobel/cpp-cheatsheet

started time in 2 months

pull request commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

@gunan Seems we are stuck with the CLA. Would you mind to help manually verify the CLA status? All the authors have commented above.

byronyi

comment created time in 2 months

issue commentbazelbuild/bazel

Allow selecting different python paths for different platforms

ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0 is a port of the official manylinux2010 image, which install python version to /usr prefix due to my environment constraints.

/usr/bin/python3 points to /usr/bin/python3.5.

    tensorflow_rbe_config(
        name = "ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0",
        compiler = "/dt7/usr/bin/gcc",
        compiler_prefix = "/usr/bin",
        cuda_version = "10.1",
        cudnn_version = "7",
        os = "ubuntu16.04-manylinux2014-multipython",
        python_versions = ["2.7", "3.5", "3.6", "3.7", "3.8"],
        tensorrt_install_path = "/usr",
        tensorrt_version = "6.0",
    )

Bazel command line:

bazel build \
    --action_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1 \
    --config=avx2_linux \
    --config=short_logs \
    --config=xla \
    --crosstool_top="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0//crosstool:toolchain" \
    --define=EXECUTOR=remote \
    --define=using_cuda_nvcc=true \
    --extra_execution_platforms="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0_config_platform//:platform" \
    --extra_toolchains="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0_config_cuda//crosstool:toolchain-linux-x86_64" \
    --jobs=192 \
    --platforms="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0_config_platform//:platform" \
    --python_path="/usr/bin/python3.6" \
    --remote_cache=$CI_CLUSTER \
    --remote_download_toplevel \
    --remote_executor=$CI_CLUSTER \
    --remote_timeout=3600 \
    --repo_env=PYTHON_BIN_PATH="/usr/bin/python3.6" \
    --repo_env=TF_NEED_CUDA=1 \
    --repo_env=TF_NEED_TENSORRT=1 \
    --repo_env=TF_CUDA_VERSION=10.1 \
    --repo_env=TF_CUDNN_VERSION=7 \
    --repo_env=TF_CUDA_CONFIG_REPO="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0_config_cuda" \
    --repo_env=TF_NCCL_CONFIG_REPO="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0_config_nccl" \
    --repo_env=TF_PYTHON_CONFIG_REPO="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0_config_python3.6" \
    --repo_env=TF_TENSORRT_CONFIG_REPO="@ubuntu16.04-gcc7_manylinux2014-cuda10.1-cudnn7-tensorrt6.0_config_tensorrt" \
    --spawn_strategy=remote,worker,sandboxed,local \
  //tensorflow/tools/pip_package:build_pip_package
r4nt

comment created time in 2 months

issue commentbazelbuild/bazel

Allow selecting different python paths for different platforms

@byronyi - this shouldn't be a blocker for TF multi-python, as we're working around that by tagging the rules.

Not with https://github.com/tensorflow/tensorflow/commit/9ca9a4f8e9b4d60408a092db4320805711367042 added. It goes well with the first Python version, but when I build with the next Python version, it incorrectly hit cache and pybind11 complains Python version mismatch.

I could paste a detailed log on this error.

r4nt

comment created time in 2 months

pull request commenttensorflow/tensorflow

Support Keras grouped convolutions

This seems to be reverted in dd2ea875d92eeb83e81b1cb92e29e61d488e98b2.

lgeiger

comment created time in 2 months

pull request commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

@googlebot I fixed it.

byronyi

comment created time in 2 months

issue commentgrpc/grpc

ServerBuilder::RegisterAsyncService() does not exist

Same question here.

qlyoung

comment created time in 2 months

pull request commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

@lilbedwin @shanshanpt @YongCHN mind to confirm the CLA status manually?

byronyi

comment created time in 2 months

Pull request review commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

 maybe( load("@com_google_protobuf//:protobuf_deps.bzl", "protobuf_deps")  protobuf_deps()++maybe(+    http_archive,+    name = "com_github_nelhage_rules_boost",+    sha256 = "f8c9653c1c49489c04f9f87ab1ee93d7b59bb26a39d9e30e9687fca3c6197c3f",+    strip_prefix = "rules_boost-9f9fb8b2f0213989247c9d5c0e814a8451d18d7f",+    urls = ["https://github.com/nelhage/rules_boost/archive/9f9fb8b2f0213989247c9d5c0e814a8451d18d7f.tar.gz"],+)++load("@com_github_nelhage_rules_boost//:boost/boost.bzl", "boost_deps")++boost_deps()++maybe(+    http_archive,+    name = "fmtlib",+    build_file = "//third_party:fmtlib.BUILD",+    sha256 = "3c812a18e9f72a88631ab4732a97ce9ef5bcbefb3235e9fd465f059ba204359b",+    strip_prefix = "fmt-5.2.1",+    urls = [+        "https://github.com/fmtlib/fmt/archive/5.2.1.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "cares",+    build_file = "//third_party/cares:cares.BUILD",+    sha256 = "03f708f1b14a26ab26c38abd51137640cb444d3ec72380b21b20f1a8d2861da7",+    strip_prefix = "c-ares-1.13.0",+    urls = [+        "https://c-ares.haxx.se/download/c-ares-1.13.0.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "yaml-cpp",+    build_file = "//third_party:yaml-cpp.BUILD",+    sha256 = "77ea1b90b3718aa0c324207cb29418f5bced2354c2e483a9523d98c3460af1ed",+    strip_prefix = "yaml-cpp-yaml-cpp-0.6.3",+    urls = [+        "https://github.com/jbeder/yaml-cpp/archive/yaml-cpp-0.6.3.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "ragel",+    build_file = "//third_party:ragel.BUILD",+    sha256 = "5f156edb65d20b856d638dd9ee2dfb43285914d9aa2b6ec779dac0270cd56c3f",+    strip_prefix = "ragel-6.10",+    urls = [+        "http://www.colm.net/files/ragel/ragel-6.10.tar.gz",

Fixed. @gunan PTAL

byronyi

comment created time in 2 months

push eventbyronyi/networking

Bairen Yi

commit sha ee3d79784b65f3d08e078261dfe6dc6a3ff1afac

ragel: Bump Ragel version to 7.0+ which is licensed under MIT Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

Bairen Yi

commit sha 661767e73251f283348552e43a016801934006d4

seastar: Make license of seastar explicit Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

push time in 2 months

CommitCommentEvent
CommitCommentEvent

PR opened tensorflow/serving

Fix typo
+1 -1

0 comment

1 changed file

pr created time in 2 months

push eventbyronyi/serving

Bairen Yi

commit sha dbcd54fc323a87b459de36a6f4ab9d03cc4679d2

Fix typo

view details

push time in 2 months

fork byronyi/serving

A flexible, high-performance serving system for machine learning models

https://www.tensorflow.org/serving

fork in 2 months

issue commenttensorflow/runtime

All registered kernel functions are c++ function now, how to register MLIR compiled function?

TFRT will support different flavors of kernels: codegen'ed (via xla, mlir, or other technology), hand-written (e.g. via Eigen), and library-based (e.g. by calling into cuDNN).

codengen'ed kernels can be AOT (ahead of time) generated or jitted. Kernel codegen is being worked on, but we don't yet have an open sourcing timeline for you.

TFRT can call into (most of the) existing TF ops and kernels. We will share some pointers when the work is ready.

My hunch it that it looks like a linker for binaries, be it AOT or JIT compiled. Assume we do not need to do link time optimization, then a linker boils down to call convention.

Being interoperable is great for downstream library authors.

shanshanpt

comment created time in 2 months

Pull request review commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

 maybe( load("@com_google_protobuf//:protobuf_deps.bzl", "protobuf_deps")  protobuf_deps()++maybe(+    http_archive,+    name = "com_github_nelhage_rules_boost",+    sha256 = "f8c9653c1c49489c04f9f87ab1ee93d7b59bb26a39d9e30e9687fca3c6197c3f",+    strip_prefix = "rules_boost-9f9fb8b2f0213989247c9d5c0e814a8451d18d7f",+    urls = ["https://github.com/nelhage/rules_boost/archive/9f9fb8b2f0213989247c9d5c0e814a8451d18d7f.tar.gz"],+)++load("@com_github_nelhage_rules_boost//:boost/boost.bzl", "boost_deps")++boost_deps()++maybe(+    http_archive,+    name = "fmtlib",+    build_file = "//third_party:fmtlib.BUILD",+    sha256 = "3c812a18e9f72a88631ab4732a97ce9ef5bcbefb3235e9fd465f059ba204359b",+    strip_prefix = "fmt-5.2.1",+    urls = [+        "https://github.com/fmtlib/fmt/archive/5.2.1.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "cares",+    build_file = "//third_party/cares:cares.BUILD",+    sha256 = "03f708f1b14a26ab26c38abd51137640cb444d3ec72380b21b20f1a8d2861da7",+    strip_prefix = "c-ares-1.13.0",+    urls = [+        "https://c-ares.haxx.se/download/c-ares-1.13.0.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "yaml-cpp",+    build_file = "//third_party:yaml-cpp.BUILD",+    sha256 = "77ea1b90b3718aa0c324207cb29418f5bced2354c2e483a9523d98c3460af1ed",+    strip_prefix = "yaml-cpp-yaml-cpp-0.6.3",+    urls = [+        "https://github.com/jbeder/yaml-cpp/archive/yaml-cpp-0.6.3.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "ragel",+    build_file = "//third_party:ragel.BUILD",+    sha256 = "5f156edb65d20b856d638dd9ee2dfb43285914d9aa2b6ec779dac0270cd56c3f",+    strip_prefix = "ragel-6.10",+    urls = [+        "http://www.colm.net/files/ragel/ragel-6.10.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "cryptopp",+    build_file = "//third_party:cryptopp.BUILD",+    sha256 = "e3bcd48a62739ad179ad8064b523346abb53767bcbefc01fe37303412292343e",+    strip_prefix = "cryptopp-CRYPTOPP_8_2_0",+    urls = [+        "https://github.com/weidai11/cryptopp/archive/CRYPTOPP_8_2_0.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "lz4",

Fixed.

byronyi

comment created time in 2 months

push eventbyronyi/networking

Bairen Yi

commit sha 672ef6d5daeeedeee4adfdb4d1440e39fbd4c209

lz4: Strip lib/ prefix for BSD 2-clause licensed files only Signed-off-by: Bairen Yi <yibairen.byron@bytedance.com>

view details

push time in 2 months

Pull request review commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

 maybe( load("@com_google_protobuf//:protobuf_deps.bzl", "protobuf_deps")  protobuf_deps()++maybe(+    http_archive,+    name = "com_github_nelhage_rules_boost",+    sha256 = "f8c9653c1c49489c04f9f87ab1ee93d7b59bb26a39d9e30e9687fca3c6197c3f",+    strip_prefix = "rules_boost-9f9fb8b2f0213989247c9d5c0e814a8451d18d7f",+    urls = ["https://github.com/nelhage/rules_boost/archive/9f9fb8b2f0213989247c9d5c0e814a8451d18d7f.tar.gz"],+)++load("@com_github_nelhage_rules_boost//:boost/boost.bzl", "boost_deps")++boost_deps()++maybe(+    http_archive,+    name = "fmtlib",+    build_file = "//third_party:fmtlib.BUILD",+    sha256 = "3c812a18e9f72a88631ab4732a97ce9ef5bcbefb3235e9fd465f059ba204359b",+    strip_prefix = "fmt-5.2.1",+    urls = [+        "https://github.com/fmtlib/fmt/archive/5.2.1.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "cares",+    build_file = "//third_party/cares:cares.BUILD",+    sha256 = "03f708f1b14a26ab26c38abd51137640cb444d3ec72380b21b20f1a8d2861da7",+    strip_prefix = "c-ares-1.13.0",+    urls = [+        "https://c-ares.haxx.se/download/c-ares-1.13.0.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "yaml-cpp",+    build_file = "//third_party:yaml-cpp.BUILD",+    sha256 = "77ea1b90b3718aa0c324207cb29418f5bced2354c2e483a9523d98c3460af1ed",+    strip_prefix = "yaml-cpp-yaml-cpp-0.6.3",+    urls = [+        "https://github.com/jbeder/yaml-cpp/archive/yaml-cpp-0.6.3.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "ragel",+    build_file = "//third_party:ragel.BUILD",+    sha256 = "5f156edb65d20b856d638dd9ee2dfb43285914d9aa2b6ec779dac0270cd56c3f",+    strip_prefix = "ragel-6.10",+    urls = [+        "http://www.colm.net/files/ragel/ragel-6.10.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "cryptopp",+    build_file = "//third_party:cryptopp.BUILD",+    sha256 = "e3bcd48a62739ad179ad8064b523346abb53767bcbefc01fe37303412292343e",+    strip_prefix = "cryptopp-CRYPTOPP_8_2_0",+    urls = [+        "https://github.com/weidai11/cryptopp/archive/CRYPTOPP_8_2_0.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "lz4",

Sorry, I didn't look carefully. Will fix.

byronyi

comment created time in 2 months

Pull request review commenttensorflow/networking

grpc+seastar: add grpc+seastar protocol

 maybe( load("@com_google_protobuf//:protobuf_deps.bzl", "protobuf_deps")  protobuf_deps()++maybe(+    http_archive,+    name = "com_github_nelhage_rules_boost",+    sha256 = "f8c9653c1c49489c04f9f87ab1ee93d7b59bb26a39d9e30e9687fca3c6197c3f",+    strip_prefix = "rules_boost-9f9fb8b2f0213989247c9d5c0e814a8451d18d7f",+    urls = ["https://github.com/nelhage/rules_boost/archive/9f9fb8b2f0213989247c9d5c0e814a8451d18d7f.tar.gz"],+)++load("@com_github_nelhage_rules_boost//:boost/boost.bzl", "boost_deps")++boost_deps()++maybe(+    http_archive,+    name = "fmtlib",+    build_file = "//third_party:fmtlib.BUILD",+    sha256 = "3c812a18e9f72a88631ab4732a97ce9ef5bcbefb3235e9fd465f059ba204359b",+    strip_prefix = "fmt-5.2.1",+    urls = [+        "https://github.com/fmtlib/fmt/archive/5.2.1.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "cares",+    build_file = "//third_party/cares:cares.BUILD",+    sha256 = "03f708f1b14a26ab26c38abd51137640cb444d3ec72380b21b20f1a8d2861da7",+    strip_prefix = "c-ares-1.13.0",+    urls = [+        "https://c-ares.haxx.se/download/c-ares-1.13.0.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "yaml-cpp",+    build_file = "//third_party:yaml-cpp.BUILD",+    sha256 = "77ea1b90b3718aa0c324207cb29418f5bced2354c2e483a9523d98c3460af1ed",+    strip_prefix = "yaml-cpp-yaml-cpp-0.6.3",+    urls = [+        "https://github.com/jbeder/yaml-cpp/archive/yaml-cpp-0.6.3.tar.gz",+    ],+)++maybe(+    http_archive,+    name = "ragel",+    build_file = "//third_party:ragel.BUILD",+    sha256 = "5f156edb65d20b856d638dd9ee2dfb43285914d9aa2b6ec779dac0270cd56c3f",+    strip_prefix = "ragel-6.10",+    urls = [+        "http://www.colm.net/files/ragel/ragel-6.10.tar.gz",

Or we can change it to 7.0.0+dev which is licensed under MIT. Let me know if that's preferred.

byronyi

comment created time in 2 months

more