profile
viewpoint

tensorflow/estimator 172

TensorFlow Estimator

rchao/community 0

Stores documents used by the TensorFlow developer community

rchao/data_challenge 0

Data Challenge

rchao/keras 0

Deep Learning for humans

rchao/tracing-framework 0

Web Tracing Framework libraries and extensions.

pull request commenttensorflow/tensorflow

Log learning rate to TensorBoard when using schedules

@lgeiger sorry for the delay - I'd like to defer to @omalleyt12 for his familiarity with TensorBoard callback.

lgeiger

comment created time in 12 days

Pull request review commenttensorflow/tensorflow

Fixed _save_model not working for batches in ModelCheckpoint Callback

 def _should_save_on_batch(self, batch):       return True     return False -  def _save_model(self, epoch, logs):+  def _save_model(self, epoch, batch, logs):     """Saves the model.      Arguments:         epoch: the epoch this iteration is in.+        batch: the batch this iteration is in. `None` if the `save_freq`+          is set to `epoch`.

Can this fit in the previous line?

ashutosh1919

comment created time in a month

Pull request review commenttensorflow/tensorflow

Fixed _save_model not working for batches in ModelCheckpoint Callback

 def _save_model(self, epoch, logs):                         'ModelCheckpoint. Filepath used is an existing '                         'directory: {}'.format(filepath)) -  def _get_file_path(self, epoch, logs):+  def _get_file_path(self, epoch, batch, logs):     """Returns the file path for checkpoint."""     # pylint: disable=protected-access     try:       # `filepath` may contain placeholders such as `{epoch:02d}` and-      # `{mape:.2f}`. A mismatch between logged metrics and the path's+      # `{batch:02d}`. A mismatch between logged metrics and the path's       # placeholders can cause formatting to fail.-      file_path = self.filepath.format(epoch=epoch + 1, **logs)+      if not batch:+        file_path = self.filepath.format(epoch=epoch + 1, **logs)+      else:+        file_path = self.filepath.format(+          epoch=epoch + 1,

Can this fit in one line?

ashutosh1919

comment created time in a month

Pull request review commenttensorflow/tensorflow

Fixed _save_model not working for batches in ModelCheckpoint Callback

 def _save_model(self, epoch, logs):                         'ModelCheckpoint. Filepath used is an existing '                         'directory: {}'.format(filepath)) -  def _get_file_path(self, epoch, logs):+  def _get_file_path(self, epoch, batch, logs):     """Returns the file path for checkpoint."""     # pylint: disable=protected-access-    try:-      # `filepath` may contain placeholders such as `{epoch:02d}` and-      # `{mape:.2f}`. A mismatch between logged metrics and the path's-      # placeholders can cause formatting to fail.-      file_path = self.filepath.format(epoch=epoch + 1, **logs)-    except KeyError as e:-      raise KeyError('Failed to format this callback filepath: "{}". '-                     'Reason: {}'.format(self.filepath, e))-    self._write_filepath = distributed_file_utils.write_filepath(+    if not self.model._in_multi_worker_mode(

We shouldn't need this if-else anymore since distributed_file_utils.write_filepath would take care of that writing path. Can you fix this?

ashutosh1919

comment created time in a month

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.

Sg, I'll do that

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.

@yuefengz How about we remove this paragraph to avoid confusion, since we don't support it anyway? We can always add it back when it's supported.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.

For custom training loop, yes, it is their responsibility (and we should update the example accordingly). One thing we're working on is an EvaluationLoop API with which they gain automatic sidecar eval. If they use this, they are forced to provide a checkpoint_dir to __init__, matching what they use in the client when saving.

In the future, once we integrate with Keras model.fit(), users should use a callback as they would without distributed training.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:+    latest_checkpoint = get_new_checkpoint()+    checkpoint.restore(latest_checkpoint)+    eval_result = eval_fn(iterator)  # Users can print, early stop, mark ckpt..+```+++In the evaluation client, the user loads the checkpoints that were periodically saved into the model (by the training client), does evaluation over a full pass of eval dataset, and does whatever they want to do with eval results. Examples include exporting them to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.)

One option yuefengz@ and I discussed is inline-evaluation-by-single-evaluator, where one schedules an evaluation function to the dedicated evaluator job. That way, a variable (on ps) can be used for the evaluator to update, and for the client to read and act accordingly.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:

Exactly and agreed. We're working on this and will update the design doc accordingly. Thanks for pointing this out.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:

There's this flexibility to do either way. We were assuming the convenience brought by letting every job run the same binary, regardless of chief, worker, ps, or evaluator. But yes, user is free to have two or even more binaries for different jobs.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:

Our current design allows users to receive an object from strategy.schedule(), and when calling local_results(object) one gets the result. If it's not executed or not fully executed due to worker preemption, the strategy will find another worker to take over. For other kinds of errors, it is possible to design it in a way that such error would be discovered upon calling strategy.local_results() or strategy.join(). For the case where PS fails, the whole cluster would be restarted and a checkpoint previously saved is loaded.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:+    latest_checkpoint = get_new_checkpoint()+    checkpoint.restore(latest_checkpoint)+    eval_result = eval_fn(iterator)  # Users can print, early stop, mark ckpt..+```+++In the evaluation client, the user loads the checkpoints that were periodically saved into the model (by the training client), does evaluation over a full pass of eval dataset, and does whatever they want to do with eval results. Examples include exporting them to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.)+++## Implementation+++### Low-level Primitives++We can potentially expose them in the future when they are more stable and when we want to allow more advanced use cases.++We will have `Cluster` and `Worker` classes to encapsulate logic related to remote function scheduling.+++```python+class Cluster(object):++  def __init__(self, cluster_resolver, failure_handler=None):+    """Create the cluster instance and connect to the remote cluster."""+    pass++  @property+  def workers(self):+    """Return all available workers."""+    return self._workers++  def schedule(self, function, args=None, kwargs=None):+    """Schedule the function on one worker.++    It adds the function to the global scheduling queue and returns future+    objects immediately.+    """+    pass++  def join(self):+    """Block until all scheduled functions are complete."""+    pass+```+++We will probably merge this `Worker` with executors.+++```python+class Worker(object):++  def __init__(self,+               worker_job_name,+               cluster,+               max_scheduled_functions=100):+    """Create a scheduling queue and a thread that processes the queue."""+    pass++  def schedule(self, function, args=None, kwargs=None):+    """Schedule the function on the worker.++    It adds the function to the scheduling queue. It returns Future object+    immediately unless the scheduling queue is full.+    """+    pass++  def healthy(self):+    """Return a boolean indicating whether the worker is health or not."""+    pass++  def _set_dead(self):+    """Declare the worker is dead and poison all future objects."""+    pass++  def _rebuild_resources(self):+    """Rebuild worker-local resources when it is recovered from failure."""+    pass+```+++As we mentioned the return value of `schedule` will be `Future` objects if `function` has return values. The `Future` works as a container and will be later-binded with states of either success or complete failure. Overall, this `Future` class has the following benefits:++++*   It allows `schedule` method return immediately after pushing functions to its scheduling queue. It allows these methods to return without needing to wait for acknowledgement from workers.+*   It serves as the container for values or errors. It would be binded with a value or an error later. When it is rebuilt, we can replace its underlying value silently.+*   When being passed to `local_result`, we flag it to indicate that this value needs to be fetched eagerly.+*   (Future work) It captures the lineage between functions and return values so that we can rebuild any poisoned objects. ++```python+class Future(object):++  def __init__(self, closure):+    pass++  def _set_value(self, value):+    pass++  def _set_error(self, error):+    pass++  def _set_eagerly_fetch(self):+    pass+  +  def _fetch(self):+    pass+```++++We can potentially merge this `Future` class with our `Tensor` class.+

With this, it sounds to me the contract is "one would need to use local_results to deal with whatever they get from run() or schedule()", and doesn't mention the type of it.

Better yet, if possible, we encourage user not to rely on the return value of schedule(). For many things that are involved in training such as loss or metrics, they can simply be variables that get updated within the remote function call. User simply reads those values from the client.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:+    latest_checkpoint = get_new_checkpoint()+    checkpoint.restore(latest_checkpoint)+    eval_result = eval_fn(iterator)  # Users can print, early stop, mark ckpt..+```+++In the evaluation client, the user loads the checkpoints that were periodically saved into the model (by the training client), does evaluation over a full pass of eval dataset, and does whatever they want to do with eval results. Examples include exporting them to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.)+

Yes - we have some code sample and will update in the doc.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```

Yes, if we use an async executor for evaluator's function scheduling (with a checkpoint file), it should be feasible that we have a true single client (aka inline evaluation). One implication is that the evaluator would need to be considered the same cluster as the chief and workers. The plan now is that we develop an API for sidecar eval, and provide example for how inline eval can be done (possibly some util if useful), and which to be used is at their discretion.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.

I believe currently all ps's and workers (and client) are considered the same cluster, so if any of the ps's fails, the cluster restarts (yuefengz@ to correct me if wrong).

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.

I think the best solution is being able to transport error message/stack trace from worker threads to main thread (or an error handling thread) to report to user and stop the program.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:

I think what this meant is more like "If you raise an error within replica_fn, you won't be able to handle it by try-except schedule(), so the best thing to do is not to raise an error". I understand this is somewhat limiting, and I think this can be solved by having a mechanism of transporting error message from worker threads to client's main thread.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())

My understanding is by design, there's no consistency guaranteed by PS training. Values at any moment can be read from PS. Can you give an example of how there can be an invalid state? My view of this is, someone might be reading while some others may be writing, and it doesn't guarantee the value read is before writing or after writing.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  

A Future object possesses a field that may (once function returns result) or may not (when the function is only scheduled but not yet executed) have a concrete value. It's supposed to be used in conjunction with local_result with which the value of a Future is fetched.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.

Yes. I think this statement is also indicating one effect of creating model is variable creation.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.

Based on recent discussion, we plan to revise this so that if both 1) no functions are left in the queue to be scheduled, 2) no function is being executed on remote workers, are satisfied, join() returns. Before that happens, it blocks (for now without a timeout).

Also, we may design it in a way that if a worker fails to execute the function, the function goes back to the queue for other workers to execute. With this, there would not be the complication of user needing to handle an error (since the user may not know what to do, other than retrying, anyway).

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.

To the second question, since only the client is responsible for creating variables, they don't need to be synced. This is one of the benefits of single client design.

yuefengz

comment created time in 2 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(

I think the API should follow base class' contract to the extend possible: 1) It needs to return a tf.data.Dataset, and 2) the Dataset should have a per-replica batch size. Like yuefengz@ said verification should be done.

yuefengz

comment created time in 2 months

issue commenttensorflow/tensorflow

[TF 2.0] tf.estimator.ProfilerHook... is not compatible with eager execution

Discussed with @reedwm and we think the best solution is to un-export ProfilerHook from TF2. Will send out a fix for this.

lendle

comment created time in 3 months

PR opened yuefengz/community

Fix extra period.

Fix extra period.

+1 -1

0 comment

1 changed file

pr created time in 3 months

push eventrchao/community

Rick Chao

commit sha 55c142a14589678bc8df5811d0e70130badd2f70

Fix extra period. Fix extra period.

view details

push time in 3 months

fork rchao/community

Stores documents used by the TensorFlow developer community

fork in 3 months

more