profile
viewpoint

shivaram/matrix-bench 6

Single machine matrix benchmarks to compare various implementations

tomerk/spark 2

Mirror of Apache Spark

mbalazin/cse599c-17sp-projects 1

Final class projects for CSE599c-17sp Big Data Management Systems

tomerk/models 1

Models and examples built with TensorFlow

shivaram/keystone 0

The biggest, baddest pipelines around.

shivaram/spark-ml 0

proposal for the new interfaces

tomerk/addons 0

Useful extra functionality for TensorFlow 2.0 maintained by SIG-addons

tomerk/baselines 0

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

tomerk/benchmark 0

Large scale query engine benchmark

tomerk/calculating-spaceweather-keywords 0

calculate space-weather keywords (using python)

issue commenttensorflow/tensorflow

Optimizer clipvalue and clipnorm not working in Tensorflow 2.0

@karlhjm no, we still disable it in 2.2 w/ distribution strategies enabled.

@zaccharieramzi yes, happy to elaborate: There's two possible places to clip when you have distribution strategies enabled:

  • before gradients get aggregated (usually wrong)
  • after gradients get aggregated (usually right & what people expect)

We want it working w/ the second case (clipping after gradients are aggregated). The issue is the optimizers are written with clipping happening in the code before aggregation does.

We looked into changing this, but it would have required either:

  1. api changes that break existing users of optimizer apply_gradients/other non-minimize methods
  2. changing the signatures of methods optimizer implementers need to implement, breaking existing custom optimizers

So rather than:

  • quietly doing clipping in the wrong place
  • increasing churn & breaking existing users or existing custom optimizers just for this individual feature

We instead decided to leave this disabled for now. We'll roll support for this into a larger optimizer refactoring that solves a larger set of issues. (RFC for that is at https://github.com/tensorflow/community/pull/234)

bloc97

comment created time in 6 days

issue commenttensorflow/tensorflow

Dataset iterating different behavior in TF 2.1 and 2.2

@omalleyt12 is this one of the known regressions that have fixes for 2.3 that didn't quite make it in to 2.2? Or is this something to add to the list?

sondracek

comment created time in 11 days

issue closedtensorflow/tensorflow

ImageAugmentation using tf.keras.preprocessing.image.ImageDataGenerator and tf.datasets: model.fit() is running infinitely

What I need help with / What I was wondering I am facing issue while running the fit() function in TensorFlow(v 2.2.0-rc4) with augmented images(using ImageDataGenerator) passed as a dataset. The fit() function is running infinitely without stopping.

What I've tried so far I tried it with the default code which was shared in Tensorflow documentation.

Please find the code snippet below:

import tensorflow as tf from tensorflow.keras.preprocessing.image import ImageDataGenerator from tensorflow.keras.models import Sequential, Model from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras.layers import Input, Dense

flowers = tf.keras.utils.get_file( 'flower_photos', 'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz', untar=True)

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

images, labels = next(img_gen.flow_from_directory(flowers))

print(images.dtype, images.shape) print(labels.dtype, labels.shape)

train_data_gen = img_gen.flow_from_directory( batch_size=32, directory=flowers, shuffle=True, target_size=(256, 256), class_mode='categorical')

ds = tf.data.Dataset.from_generator(lambda: train_data_gen, output_types=(tf.float32, tf.float32), output_shapes=([32, 256, 256, 3], [32, 5]) )

ds = ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

it = iter(ds) batch = next(it) print(batch)

def create_model(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', input_shape=images[0].shape)) model.add(Conv2D(32, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.5)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.5)) model.add(Flatten()) model.add(Dense(64, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(5, activation='softmax')) return model

model = create_model() model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=["accuracy"]) model.fit(ds, verbose=1, batch_size= 32, epochs =1)

This last line of code - fit() is running infinitly without stopping. I had also tried passing steps_per_epoch = total_no_of_train_records/batch_size.

It would be nice if... I would like you to confirm whethere this is a bug in the tensorflow datasets package and in which release will this be fixed.

Environment information

  • System: Google colaborator
  • Python version: v3.6.9
  • `tensorflow version: v2.2.0-rc4

closed time in 14 days

Thomas-K-John

issue commenttensorflow/tensorflow

ImageAugmentation using tf.keras.preprocessing.image.ImageDataGenerator and tf.datasets: model.fit() is running infinitely

@driedler: We made a roughly equivalent change to the keras_preprocessing pip package, but users have to manually upgrade it to get that effect.

We've also submitted this: https://github.com/tensorflow/tensorflow/commit/b53ed4d560aaeb7a92185f4fbf2562e5e274456a

which will affect the next version of tensorflow. We're unlikely to patch this into the existing release, so if you want the fix right away go ahead and update keras_preprocessing.

We'll go ahead and close this issue for now.

Thomas-K-John

comment created time in 14 days

issue commenttensorflow/tensorflow

ImageAugmentation using tf.keras.preprocessing.image.ImageDataGenerator and tf.datasets: model.fit() is running infinitely

The steps argument for evaluate and predict is called steps, not steps_per_epoch. You could go ahead and set that. For validation the argument is validation_steps.

Alternatively, you could put .take(num_steps) at the end of your dataset to force it to have a fixed size.

If we patch the release to handle the case where the generator is directly passed to fit/evaluate/predict it should deal with all of these cases, so you shouldn't need to file any other bugs. We've submitted the code fix already in the TF nightlies, it's just a question of whether it'll go in 2.3 or be patched into 2.2.

Thomas-K-John

comment created time in 20 days

issue commenttensorflow/tensorflow

ImageAugmentation using tf.keras.preprocessing.image.ImageDataGenerator and tf.datasets: model.fit() is running infinitely

So two things because It's also come to my attention that apparently the ImageDataGenerator isn't always supposed to loop forever.


  1. When you use tf.data from_generator and specify a shape, the generator must always yield objects of that shape. In your call you're specifying a fixed batch size as part of the shape:
ds = tf.data.Dataset.from_generator(lambda: train_data_gen,
output_types=(tf.float32, tf.float32),
output_shapes=([32, 256, 256, 3],
[32, 5])
)

When it sees a partial batch of 22 on the last step the shapes do not match so it errors. You can fix this by leaving the batch size unspecified (as None) in from_generator:

ds = tf.data.Dataset.from_generator(lambda: train_data_gen,
output_types=(tf.float32, tf.float32),
output_shapes=([None, 256, 256, 3],
[None, 5])
)

  1. Apparently there is a setting where the ImageDataGenerator isn't supposed to loop forever and shouldn't require steps_per_epoch: If you pass the result of flow_from_directory directly to Keras fit without converting it to a dataset yourself. In this specific setting the len information attached to the ImageDataGenerator sequences has historically been used as an implicit steps_per_epoch.

(This does not happen if you manually loop over the generator w/ a for loop, as shown in the documentation of the ImageDataGenerator). This isn't the case for your example code because when you convert it to a dataset manually that cardinality information gets lost.

However it seems we did introduce a regression in that setting at some point. (Though we're not positive which exact version of TF it was introduced in). It did make its way into the TF 2.0 release, and we're currently exploring whether to do a patch release to fix it.

Thomas-K-John

comment created time in 21 days

issue commenttensorflow/tensorflow

ImageAugmentation using tf.keras.preprocessing.image.ImageDataGenerator and tf.datasets: model.fit() is running infinitely

The ImageDataGenerator returns an infinite number of values, so the epoch would never end unless you specify steps_per_epoch. The examples in ImageDataGenerator briefly mention this, but I think the documentation should be clearer about this. So I'll go ahead and update that. https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator


I'm not sure why setting steps_per_epoch didn't work for you? When I set a value for steps_per_epoch in this colab it ends once it has completed that many steps. Perhaps it's just being slow? There are ~114 steps in one epoch over the dataset (3670 training images / 32 batch size). 114 * 7 sec / step (in this colab) is close to 15 minutes of runtime.

Thomas-K-John

comment created time in 21 days

create barnchtomerk/tf_structured_sparsity

branch : master

created branch time in 24 days

created repositorytomerk/tf_structured_sparsity

created time in 24 days

issue commenttensorflow/addons

Packaging learning rate schedules with addons?

A combination of a github search and a search through our internal codebases.

Github produces enough search results that my estimates could be wrong, but for noisylinearcosinedecay and linearcosinedecay I seemed to just find forks of tensorflow: https://github.com/search?l=Python&q=%22keras.experimental.noisylinearcosinedecay%22&type=Code https://github.com/search?l=Python&q=%22keras.experimental.linearcosinedecay%22&type=Code (other than maybe < 5 usages each)

However, github does show code results for CosineDecay that aren't just forks of tensorflow core: https://github.com/search?l=Python&q=%22keras.experimental.cosinedecay%22&type=Code (After 2-3 pages it just returns forks of tensorflow)

tomerk

comment created time in a month

issue openedtensorflow/addons

Packaging learning rate schedules with addons?

We have several experimental learning rate schedules in TensorFlow/Keras currently:

https://www.tensorflow.org/api_docs/python/tf/keras/experimental/CosineDecay tf.experimental.CosineDecay tf.experimental.CosineDecayRestarts tf.experimental.LinearCosineDecay tf.experimental.NoisyLinearCosineDecay

These have been marked experimental for several releases, but have seen little to no usage (0 for linearcosine/noisylinearcosine, and only a handful of usages for CosineDecay/CosineDecayRestarts)

So, we're thinking of deprecating these in the next TF release, and dropping them entirely in the release after that.

Would Tensorflow-addons be interested in packaging learning rate schedules that have some users but not quite enough to include as part of tensorflow core? E.g. CosineDecay / cosine decay w/ restarts in this case. As I've found zero users of LinearCosineDecay/NoisyLinearCosineDecay it's not clear that they should be included in addons either.

created time in a month

issue commenttensorflow/tensorflow

tf.keras custom layer does not use "compute_output_shape"

This is a difference between tf.keras and the older OSS keras releases.

As an immediate stop-gap you should be able to manually set the output shape on the tensor and be okay, e.g.

        y = tf.sqrt(tf.abs(y))[:, :, :self.num_freqs]
        y.set_shape(self.compute_output_shape(x.shape))
        return y

Currently tf.keras uses compute_output_shape to set the output shape only when layers are dynamic and can only be run eagerly.

In the medium-term we need to figure out whether it makes sense for Keras to automatically set the output shape to the result of compute_output_shape whenever compute_output_shape is implemented, rather than just for dynamic layers.

kbrose

comment created time in a month

issue commenttensorflow/tensorflow

DistributionStrategy and Keras models: support for sample_weight_mode

Hi djo-koconi@, due to their degree of required technical complexity and relatively niche use cases, target_tensors have actually been removed entirely in recent versions of tensorflow unless you're running legacy code with eager disabled + compat.v1 apis

The recommended way to achieve the same effect as target_tensors is to override your model's training step to pass the target into the model at training time.

This thread from fchollet@ demonstrates what overriding train_step looks like (in this case for training GANs, but the idea remains the same): https://mobile.twitter.com/fchollet/status/1250622989541838848

We are also working on guides that will go more into detail in how to go about overriding different components of the training/evaluation loops to handle unique training workflows.

pawarrick

comment created time in a month

Pull request review commenttensorflow/community

RFC: Easily Customizable Optimizer.minimize

+# Easily Customizable `Optimizer.minimize`+++| Status        | Proposed       |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [234](https://github.com/tensorflow/community/pull/234) |+| **Author(s)** | [omalleyt12@](https://github.com/omalleyt12) |+| **Sponsor**   | apassos@, fchollet@, karmel@                 |+| **Updated**   | 2020-04-20                                           |++## Objective++Create an `Optimizer` API that gives `Optimizer` subclasses full control of gradient updates. The API should ensure `Optimizer`s can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the `Optimizer` chooses to:++* Scale losses and gradients++* Aggregate gradients++* Clip gradients++* etc++We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.++By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.++Specific use cases considered:++* Gradient clipping++* Mixed precision++* `Horovod`++## Background++During backpropagation, there are 6 possible actions that can be taken when starting from a loss Tensor and ending with a Variable update:++(1) Transform the loss++(2) Calculate the gradients++(3) Transform the unaggregated (per-device) gradients++(4) Aggregate the gradients (across devices)++(5) Transform the aggregated gradients++(6) Apply a variable update based on the gradients++We currently have three Optimizer endpoints that start at different points in this process:++* `Optimizer.minimize` - handles 1-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=True)` - handles 4-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=False)` - handles 6++However, there is no easy way for Optimizer subclasses to support custom logic in these steps. This proposal suggests a refactoring of the Optimizer class to achieve these goals.+++## Motivation++This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit).++Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic.++The only remaining pain point is the call to `_minimize` here: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L1873). This logic is necessary because details of whether an `Optimizer` needs to transform the loss, clip the gradients, perform custom aggregation, etc have leaked into the main training loop code.++Despite the complexity of `_minimize`, it covers only a small subset of possible optimization logic. Keras continues to receive valid requests to support more custom optimization logic (including adding hooks for different aggregation methods, different methods of loss reduction, etc). To continue expanding support for these items, Keras needs to rely on a unified API that keeps `Optimizer` implementation details from leaking into the main training loop code.++The proposal below shows how this can be accomplished, and the examples section shows how this can be applied to 3 use cases: gradient clipping, mixed precision, and `Horovod`.++### Custom training loops:++The logic above also applies to custom training loops. The design should allow custom training loops to be written so that they work with any `Optimizer`.+++## User Benefit++This design will allow users to write full-featured training loops that work for all `Optimizer`s. This design will allow users to easily perform custom gradient clipping and other transformations.++## Design Proposal++`Optimizer` class:++```python+class Optimizer(object):+  def __init__(self,+               transform_gradients=None,+               aggregate_gradients=all_reduce_sum):+     self.aggregate_gradients_fn = aggregate_gradients+     self.transform_gradients_fns = transform_gradients_fns++  def _transform_loss(self, loss):+    # Can be overridden in subclasses+    return loss++  def _get_gradients(self, loss, variables, tape):+    # Can be overridden to use jacobian, etc.+    return tape.gradient(loss, variables)++  def _transform_unaggregated_gradients(self, grads_and_vars):+    # Can be overridden in subclasses+    return grads_and_vars++  def _aggregate_gradients(self, grads_and_vars):+    # Can still be overridden in subclasses if needed+    if self.aggregate_gradients_fn:+      grads_and_vars = self.aggregate_gradients_fn(+         grads_and_vars)+    return grads_and_vars++  def _transform_gradients(self, grads_and_vars):+    # Can still be overridden in subclasses if needed+    if self.transform_gradients_fns:+      for fn in self.transform_gradients_fns:+        grads_and_vars = fn(grads_and_vars)+    return grads_and_vars+   +  def _apply_updates(self, distribution, grads_and_vars, ...):+    # Calls _resource_apply_{dense | sparse}+    # Variable updating math is still in _resource_apply_{dense | sparse}+  +  def minimize(self, loss, variables, tape=None):+    grads_and_vars = self.compute_gradients(loss, variables, tape)+    self.apply_gradients(grads_and_vars)++  def compute_gradients(+      self,+      loss,+      variables,+      tape=None,+      all_reduce_sum_gradients=False):+    if is_tensor(loss) and not tape:+      raise ValueError('Must provide tape with tensor loss.')+    tape = tape or GradientTape()+    with tape:+      if callable(loss):+        loss = loss()+      loss = self._transform_loss(loss) # A no-op in our built-in optimizers+    gradients = self._get_gradients(loss, variables, tape)+    grads_and_vars = zip(gradients, variables)+    grads_and_vars = self._transform_unaggregated_gradients(grads_and_vars)+    if all_reduce_sum_gradients:+      grads_and_vars = self._aggregate_gradients(grads_and_vars)+      grads_and_vars = self._transform_gradients(grads_and_vars)+    return grads_and_vars++  def apply_gradients(self, grads_and_vars, aggregate=True):+    if aggregate:+      grads_and_vars = self._aggregate_gradients(grads_and_vars)+      grads_and_vars = self._transform_gradients(grads_and_vars)  # No-op by default

Again for this check: shouldn't this transformation happen regardless of aggregation? Are we guarding it because there's the dynamic where aggregate_gradients must be called once to avoid double-aggregation?

omalleyt12

comment created time in a month

Pull request review commenttensorflow/community

RFC: Easily Customizable Optimizer.minimize

+# Easily Customizable `Optimizer.minimize`+++| Status        | Proposed       |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [234](https://github.com/tensorflow/community/pull/234) |+| **Author(s)** | [omalleyt12@](https://github.com/omalleyt12) |+| **Sponsor**   | apassos@, fchollet@, karmel@                 |+| **Updated**   | 2020-04-20                                           |++## Objective++Create an `Optimizer` API that gives `Optimizer` subclasses full control of gradient updates. The API should ensure `Optimizer`s can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the `Optimizer` chooses to:++* Scale losses and gradients++* Aggregate gradients++* Clip gradients++* etc++We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.++By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.++Specific use cases considered:++* Gradient clipping++* Mixed precision++* `Horovod`++## Background++During backpropagation, there are 6 possible actions that can be taken when starting from a loss Tensor and ending with a Variable update:++(1) Transform the loss++(2) Calculate the gradients++(3) Transform the unaggregated (per-device) gradients++(4) Aggregate the gradients (across devices)++(5) Transform the aggregated gradients++(6) Apply a variable update based on the gradients++We currently have three Optimizer endpoints that start at different points in this process:++* `Optimizer.minimize` - handles 1-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=True)` - handles 4-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=False)` - handles 6++However, there is no easy way for Optimizer subclasses to support custom logic in these steps. This proposal suggests a refactoring of the Optimizer class to achieve these goals.+++## Motivation++This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit).++Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic.++The only remaining pain point is the call to `_minimize` here: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L1873). This logic is necessary because details of whether an `Optimizer` needs to transform the loss, clip the gradients, perform custom aggregation, etc have leaked into the main training loop code.++Despite the complexity of `_minimize`, it covers only a small subset of possible optimization logic. Keras continues to receive valid requests to support more custom optimization logic (including adding hooks for different aggregation methods, different methods of loss reduction, etc). To continue expanding support for these items, Keras needs to rely on a unified API that keeps `Optimizer` implementation details from leaking into the main training loop code.++The proposal below shows how this can be accomplished, and the examples section shows how this can be applied to 3 use cases: gradient clipping, mixed precision, and `Horovod`.++### Custom training loops:++The logic above also applies to custom training loops. The design should allow custom training loops to be written so that they work with any `Optimizer`.+++## User Benefit++This design will allow users to write full-featured training loops that work for all `Optimizer`s. This design will allow users to easily perform custom gradient clipping and other transformations.++## Design Proposal++`Optimizer` class:++```python+class Optimizer(object):+  def __init__(self,+               transform_gradients=None,+               aggregate_gradients=all_reduce_sum):+     self.aggregate_gradients_fn = aggregate_gradients+     self.transform_gradients_fns = transform_gradients_fns++  def _transform_loss(self, loss):+    # Can be overridden in subclasses+    return loss++  def _get_gradients(self, loss, variables, tape):+    # Can be overridden to use jacobian, etc.+    return tape.gradient(loss, variables)++  def _transform_unaggregated_gradients(self, grads_and_vars):+    # Can be overridden in subclasses+    return grads_and_vars++  def _aggregate_gradients(self, grads_and_vars):+    # Can still be overridden in subclasses if needed+    if self.aggregate_gradients_fn:+      grads_and_vars = self.aggregate_gradients_fn(+         grads_and_vars)+    return grads_and_vars++  def _transform_gradients(self, grads_and_vars):+    # Can still be overridden in subclasses if needed+    if self.transform_gradients_fns:+      for fn in self.transform_gradients_fns:+        grads_and_vars = fn(grads_and_vars)+    return grads_and_vars+   +  def _apply_updates(self, distribution, grads_and_vars, ...):+    # Calls _resource_apply_{dense | sparse}+    # Variable updating math is still in _resource_apply_{dense | sparse}+  +  def minimize(self, loss, variables, tape=None):+    grads_and_vars = self.compute_gradients(loss, variables, tape)+    self.apply_gradients(grads_and_vars)++  def compute_gradients(+      self,+      loss,+      variables,+      tape=None,+      all_reduce_sum_gradients=False):+    if is_tensor(loss) and not tape:+      raise ValueError('Must provide tape with tensor loss.')+    tape = tape or GradientTape()+    with tape:+      if callable(loss):+        loss = loss()+      loss = self._transform_loss(loss) # A no-op in our built-in optimizers+    gradients = self._get_gradients(loss, variables, tape)+    grads_and_vars = zip(gradients, variables)+    grads_and_vars = self._transform_unaggregated_gradients(grads_and_vars)+    if all_reduce_sum_gradients:+      grads_and_vars = self._aggregate_gradients(grads_and_vars)+      grads_and_vars = self._transform_gradients(grads_and_vars)

+1

omalleyt12

comment created time in a month

Pull request review commenttensorflow/community

RFC: Easily Customizable Optimizer.minimize

+# Easily Customizable `Optimizer.minimize`+++| Status        | Proposed       |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [234](https://github.com/tensorflow/community/pull/234) |+| **Author(s)** | [omalleyt12@](https://github.com/omalleyt12) |+| **Sponsor**   | apassos@, fchollet@, karmel@                 |+| **Updated**   | 2020-04-20                                           |++## Objective++Create an `Optimizer` API that gives `Optimizer` subclasses full control of gradient updates. The API should ensure `Optimizer`s can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the `Optimizer` chooses to:++* Scale losses and gradients++* Aggregate gradients++* Clip gradients++* etc++We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.++By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.++Specific use cases considered:++* Gradient clipping++* Mixed precision++* `Horovod`++## Background++During backpropagation, there are 6 possible actions that can be taken when starting from a loss Tensor and ending with a Variable update:++(1) Transform the loss++(2) Calculate the gradients++(3) Transform the unaggregated (per-device) gradients++(4) Aggregate the gradients (across devices)++(5) Transform the aggregated gradients++(6) Apply a variable update based on the gradients++We currently have three Optimizer endpoints that start at different points in this process:++* `Optimizer.minimize` - handles 1-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=True)` - handles 4-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=False)` - handles 6++However, there is no easy way for Optimizer subclasses to support custom logic in these steps. This proposal suggests a refactoring of the Optimizer class to achieve these goals.+++## Motivation++This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit).++Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic.++The only remaining pain point is the call to `_minimize` here: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L1873). This logic is necessary because details of whether an `Optimizer` needs to transform the loss, clip the gradients, perform custom aggregation, etc have leaked into the main training loop code.++Despite the complexity of `_minimize`, it covers only a small subset of possible optimization logic. Keras continues to receive valid requests to support more custom optimization logic (including adding hooks for different aggregation methods, different methods of loss reduction, etc). To continue expanding support for these items, Keras needs to rely on a unified API that keeps `Optimizer` implementation details from leaking into the main training loop code.++The proposal below shows how this can be accomplished, and the examples section shows how this can be applied to 3 use cases: gradient clipping, mixed precision, and `Horovod`.++### Custom training loops:++The logic above also applies to custom training loops. The design should allow custom training loops to be written so that they work with any `Optimizer`.+++## User Benefit++This design will allow users to write full-featured training loops that work for all `Optimizer`s. This design will allow users to easily perform custom gradient clipping and other transformations.++## Design Proposal++`Optimizer` class:++```python+class Optimizer(object):+  def __init__(self,+               transform_gradients=None,+               aggregate_gradients=all_reduce_sum):+     self.aggregate_gradients_fn = aggregate_gradients+     self.transform_gradients_fns = transform_gradients_fns++  def _transform_loss(self, loss):+    # Can be overridden in subclasses+    return loss++  def _get_gradients(self, loss, variables, tape):+    # Can be overridden to use jacobian, etc.+    return tape.gradient(loss, variables)++  def _transform_unaggregated_gradients(self, grads_and_vars):+    # Can be overridden in subclasses+    return grads_and_vars++  def _aggregate_gradients(self, grads_and_vars):+    # Can still be overridden in subclasses if needed+    if self.aggregate_gradients_fn:+      grads_and_vars = self.aggregate_gradients_fn(+         grads_and_vars)+    return grads_and_vars++  def _transform_gradients(self, grads_and_vars):+    # Can still be overridden in subclasses if needed+    if self.transform_gradients_fns:+      for fn in self.transform_gradients_fns:+        grads_and_vars = fn(grads_and_vars)+    return grads_and_vars+   +  def _apply_updates(self, distribution, grads_and_vars, ...):+    # Calls _resource_apply_{dense | sparse}+    # Variable updating math is still in _resource_apply_{dense | sparse}+  +  def minimize(self, loss, variables, tape=None):+    grads_and_vars = self.compute_gradients(loss, variables, tape)+    self.apply_gradients(grads_and_vars)++  def compute_gradients(+      self,+      loss,+      variables,+      tape=None,+      all_reduce_sum_gradients=False):

why is the arg named all_reduce_sum_gradients even though users may have changed the aggregation from the default?

omalleyt12

comment created time in a month

Pull request review commenttensorflow/community

RFC: Easily Customizable Optimizer.minimize

+# Easily Customizable `Optimizer.minimize`+++| Status        | Proposed       |+:-------------- |:---------------------------------------------------- |+| **RFC #**     | [234](https://github.com/tensorflow/community/pull/234) |+| **Author(s)** | [omalleyt12@](https://github.com/omalleyt12) |+| **Sponsor**   | apassos@, fchollet@, karmel@                 |+| **Updated**   | 2020-04-20                                           |++## Objective++Create an `Optimizer` API that gives `Optimizer` subclasses full control of gradient updates. The API should ensure `Optimizer`s can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the `Optimizer` chooses to:++* Scale losses and gradients++* Aggregate gradients++* Clip gradients++* etc++We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.++By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.++Specific use cases considered:++* Gradient clipping++* Mixed precision++* `Horovod`++## Background++During backpropagation, there are 6 possible actions that can be taken when starting from a loss Tensor and ending with a Variable update:++(1) Transform the loss++(2) Calculate the gradients++(3) Transform the unaggregated (per-device) gradients++(4) Aggregate the gradients (across devices)++(5) Transform the aggregated gradients++(6) Apply a variable update based on the gradients++We currently have three Optimizer endpoints that start at different points in this process:++* `Optimizer.minimize` - handles 1-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=True)` - handles 4-6++* `Optimizer.apply_gradients(..., experimental_aggregate_gradients=False)` - handles 6++However, there is no easy way for Optimizer subclasses to support custom logic in these steps. This proposal suggests a refactoring of the Optimizer class to achieve these goals.+++## Motivation++This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit).++Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic.++The only remaining pain point is the call to `_minimize` here: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L1873). This logic is necessary because details of whether an `Optimizer` needs to transform the loss, clip the gradients, perform custom aggregation, etc have leaked into the main training loop code.++Despite the complexity of `_minimize`, it covers only a small subset of possible optimization logic. Keras continues to receive valid requests to support more custom optimization logic (including adding hooks for different aggregation methods, different methods of loss reduction, etc). To continue expanding support for these items, Keras needs to rely on a unified API that keeps `Optimizer` implementation details from leaking into the main training loop code.++The proposal below shows how this can be accomplished, and the examples section shows how this can be applied to 3 use cases: gradient clipping, mixed precision, and `Horovod`.++### Custom training loops:++The logic above also applies to custom training loops. The design should allow custom training loops to be written so that they work with any `Optimizer`.+++## User Benefit++This design will allow users to write full-featured training loops that work for all `Optimizer`s. This design will allow users to easily perform custom gradient clipping and other transformations.++## Design Proposal++`Optimizer` class:++```python+class Optimizer(object):+  def __init__(self,+               transform_gradients=None,+               aggregate_gradients=all_reduce_sum):+     self.aggregate_gradients_fn = aggregate_gradients+     self.transform_gradients_fns = transform_gradients_fns++  def _transform_loss(self, loss):+    # Can be overridden in subclasses

I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too.

omalleyt12

comment created time in a month

issue commenttensorflow/addons

TF addons compatibility with TF nightly / TF2.2

Hi @seanpmorgan, just to clarify here: We're trying to do some internal performance testing that I believe requires using a custom op.

So, @nnigania tried used tf nightly with tensorflow addons. If the custom ops are not compatible, does that mean the experiments need to build addons at head rather than pip installing addons?

Or, is there a way to try supressing the warning while still trying to use the custom op instead of the python composite op? (And if it segfaults so be it)

nnigania

comment created time in a month

issue closedtensorflow/tensorflow

tf.keras.layers.BatchNormalization() may not work in tf=2.0 and eager model is disable

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 in Docker
  • TensorFlow installed from (source or binary): pip install
  • TensorFlow version (use command below): v2.0.0-rc2-26-g64c3d38
  • Python version: 3.5
  • CUDA/cuDNN version: 10.0 / 7
  • GPU model and memory: GTX 1080Ti / 11175MiB

Describe the current behavior

Hi authors and developers,

I am developing our project in tf=2.0.0 and eager_mode is disable.

The main reason is tf=1.x will not be maintained but third party libraries have not been ready for tf=2.0 yet.

This issues is a separate issues from #35050

This potential issue is somethine wrong if users do custom training with level API which includes tf.keras.layers.BatchNormalization() in tf=2.0 and eager model is disable.

I summary the testcaset as the following:

#%%
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
#tf.compat.v1.disable_v2_behavior()

import numpy as np

batch_size = 100

def download_data():

    # get raw data
    (trainX, trainY), (testX, testY) = tf.keras.datasets.cifar10.load_data()
    trainX = trainX.astype(np.float32)
    testX  = testX.astype(np.float32)

    # ont-hot
    trainY = tf.keras.utils.to_categorical(trainY, 10)
    testY  = tf.keras.utils.to_categorical(testY , 10)

    # get validation sets
    training_size = 45000
    validX = trainX[training_size:,:]
    validY = trainY[training_size:,:]

    trainX = trainX[:training_size,:]
    trainY = trainY[:training_size,:]

    return trainX, trainY, validX, validY, testX, testY

def data_pipeline(dataX, dataY):

        dataset = tf.data.Dataset.from_tensor_slices( (dataX, dataY) )
        dataset = dataset.shuffle(batch_size * 8)
        dataset = dataset.repeat()
        dataset = dataset.batch(batch_size)
        dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
        return dataset

class custom_model():
    def __init__(self):

        def Acc():
            acc = tf.keras.metrics.categorical_accuracy(label_ref, clf_out)
            return tf.math.reduce_mean(acc)

        def c_loss():
            loss = tf.keras.losses.categorical_crossentropy(label_ref, clf_out)
            loss = tf.math.reduce_mean(loss)
            return loss

        # create model
        clf_input = tf.keras.layers.Input(shape=(32,32,3), name="model/input")
        model = tf.keras.applications.resnet_v2.ResNet50V2(include_top=True, weights=None, input_tensor=clf_input, pooling='max', classes=10)
        #model = tf.keras.applications.vgg16.VGG16(include_top=True, weights=None, input_tensor=clf_input, pooling='max', classes=10)
        model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])

        label_ref = tf.keras.layers.Input(shape=(10,) , name='label_ref')
        clf_out = model(clf_input)

        # using tf.keras.optimizers.Nadam would get error
        #optimizer = tf.keras.optimizers.Nadam(lr=0.0005)
        optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.01)
        self.train_op = optimizer.minimize(c_loss(), var_list=[model.trainable_variables])

        self.clf_model = model
        self.clf_input = clf_input
        self.label_ref = label_ref
        self.op_acc = Acc()
        self.c_loss = c_loss()

if __name__ == '__main__':

    # set GPU
    import os
    if os.environ.get("CUDA_VISIBLE_DEVICES") is None:
        os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    # reset tf session
    tf.compat.v1.keras.backend.clear_session()
    gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)
    sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))
    tf.compat.v1.keras.backend.set_session(sess) 

    # prepare data
    trainX, trainY, validX, validY, testX, testY = download_data()
    train_gen = data_pipeline(trainX, trainY)
    valid_gen = data_pipeline(validX, validY)
    test_gen = data_pipeline(testX, testY)

    # build targeted model
    model = tf.keras.applications.resnet_v2.ResNet50V2(include_top=True, weights=None, input_shape=(32,32,3), pooling='max', classes=10)
    #model = tf.keras.applications.vgg16.VGG16(include_top=True, weights=None, input_shape=(32,32,3), pooling=None, classes=10)
    model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])

    # fit and evalutate
    model.fit(train_gen,
            steps_per_epoch = trainY.shape[0] // batch_size,
            validation_data = valid_gen,
            validation_steps= validY.shape[0] // batch_size,
            epochs=5,
            verbose=2)
    model.evaluate(testX, testY, verbose=2, batch_size=batch_size)

    # create a new model
    print('Make sure that we create a new model.')
    model = custom_model()
    sess.run(tf.compat.v1.global_variables_initializer())
    model.clf_model.evaluate(testX, testY, verbose=2, batch_size=batch_size)

    # train model
    num_epoch = 5
    total_len = trainY.shape[0] // batch_size
    tf_iter = tf.compat.v1.data.make_initializable_iterator(train_gen)
    tf_next = tf_iter.get_next()
    sess.run(tf_iter.initializer)
    for epoch in range(num_epoch):
        c_loss, acc = 0.0, 0.0
        for ii in range(total_len):
            X, Y = sess.run(tf_next)
            [b_c_loss, b_acc, _] = sess.run([model.c_loss, model.op_acc, model.train_op],
                                                feed_dict={ model.clf_input: X,
                                                            model.label_ref: Y,
                                                            tf.keras.backend.learning_phase(): 1})
            c_loss = c_loss + b_c_loss
            acc = acc + b_acc
        
        c_loss = c_loss / total_len
        acc = acc / total_len
        print('[Training]Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss, acc) )

    print('Show loss and accuracy with keras API')
    model.clf_model.evaluate(trainX, trainY, verbose=2, batch_size=batch_size)
    model.clf_model.evaluate(validX, validY, verbose=2, batch_size=batch_size)
    model.clf_model.evaluate(testX, testY, verbose=2, batch_size=batch_size)

    print('Show loss and accuracy with low level API')
    # evaluate
    num_epoch = 1
    total_len = validY.shape[0] // batch_size
    tf_iter = tf.compat.v1.data.make_initializable_iterator(valid_gen)
    tf_next = tf_iter.get_next()
    sess.run(tf_iter.initializer)
    for epoch in range(num_epoch):
        c_loss_t, acc_t, c_loss_f, acc_f = 0.0, 0.0, 0.0, 0.0
        for ii in range(total_len):
            X, Y = sess.run(tf_next)
            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 1})
            c_loss_t = c_loss_t + b_c_loss
            acc_t = acc_t + b_acc

            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 0})
            c_loss_f = c_loss_f + b_c_loss
            acc_f = acc_f + b_acc

        c_loss_t = c_loss_t / total_len
        c_loss_f = c_loss_f / total_len
        acc_t = acc_t / total_len
        acc_f = acc_f / total_len
        print('[Validation][learning_phase=1] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_t, acc_t) )
        print('[Validation][learning_phase=0] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_f, acc_f) )

    # evaluate
    num_epoch = 1
    total_len = testY.shape[0] // batch_size
    tf_iter = tf.compat.v1.data.make_initializable_iterator(test_gen)
    tf_next = tf_iter.get_next()
    sess.run(tf_iter.initializer)
    for epoch in range(num_epoch):
        c_loss_t, acc_t, c_loss_f, acc_f = 0.0, 0.0, 0.0, 0.0
        for ii in range(total_len):
            X, Y = sess.run(tf_next)
            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 1})
            c_loss_t = c_loss_t + b_c_loss
            acc_t = acc_t + b_acc

            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 0})
            c_loss_f = c_loss_f + b_c_loss
            acc_f = acc_f + b_acc

        c_loss_t = c_loss_t / total_len
        c_loss_f = c_loss_f / total_len
        acc_t = acc_t / total_len
        acc_f = acc_f / total_len
        print('[Testing][learning_phase=1] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_t, acc_t) )
        print('[Testing][learning_phase=0] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_f, acc_f) )

The first part of testing case is training model with high leval API and the result is as expected.

450/450 - 39s - loss: 1.9658 - accuracy: 0.2993 - val_loss: 1.7215 - val_accuracy: 0.3738
Epoch 2/5
450/450 - 28s - loss: 1.5722 - accuracy: 0.4334 - val_loss: 1.5897 - val_accuracy: 0.4152
Epoch 3/5
450/450 - 27s - loss: 1.3876 - accuracy: 0.4993 - val_loss: 1.4867 - val_accuracy: 0.4770
Epoch 4/5
450/450 - 28s - loss: 1.2564 - accuracy: 0.5477 - val_loss: 1.3498 - val_accuracy: 0.5060
Epoch 5/5
450/450 - 27s - loss: 1.1488 - accuracy: 0.5888 - val_loss: 1.3380 - val_accuracy: 0.5232
10000/10000 - 3s - loss: 1.3523 - accuracy: 0.5289

I got a strange loss and the ourput can be seen the following:

Make sure that we create a new model.
10000/10000 - 3s - loss: 10.2004 - accuracy: 0.1048
[Training]Epoch: 1/5 - loss: 2.288 - acc: 0.268
[Training]Epoch: 2/5 - loss: 1.513 - acc: 0.448
[Training]Epoch: 3/5 - loss: 1.285 - acc: 0.537
[Training]Epoch: 4/5 - loss: 1.426 - acc: 0.487
[Training]Epoch: 5/5 - loss: 1.306 - acc: 0.535
Show loss and accuracy with keras API
45000/45000 - 9s - loss: nan - accuracy: 0.1002
5000/5000 - 1s - loss: nan - accuracy: 0.0986
10000/10000 - 2s - loss: nan - accuracy: 0.1000
Show loss and accuracy with low level API
[Validation][learning_phase=1] Epoch: 1/1 - loss: 1.163 - acc: 0.585
[Validation][learning_phase=0] Epoch: 1/1 - loss: nan - acc: 0.099
[Testing][learning_phase=1] Epoch: 1/1 - loss: 1.179 - acc: 0.587
[Testing][learning_phase=0] Epoch: 1/1 - loss: nan - acc: 0.100

Obviously, after training custom model with low level API, the result would be wrong when setting tf.keras.backend.learning_phase(): 0

Also, the result from keras API is wrong too.

tf.keras.backend.learning_phase(): 0 may affect the behavior of tf.keras.layers.BatchNormalization() but I'm not sure whether this is root cause.

I have tried a small custom model without tf.keras.layers.BatchNormalization() for MNIST dataset and the result is normal.

The testcase for MNIST as shown in the following:

import tensorflow as tf
tf.compat.v1.disable_eager_execution()
#tf.compat.v1.disable_v2_behavior()

import numpy as np

batch_size = 100

def download_data():

    # get raw data
    (trainX, trainY), (testX, testY) = tf.keras.datasets.mnist.load_data()
    trainX = trainX.astype(np.float32)
    testX  = testX.astype(np.float32)

    # ont-hot
    trainY = tf.keras.utils.to_categorical(trainY, 10)
    testY  = tf.keras.utils.to_categorical(testY , 10)

    # get validation sets
    training_size = 55000
    validX = trainX[training_size:,:]
    validY = trainY[training_size:,:]

    trainX = trainX[:training_size,:]
    trainY = trainY[:training_size,:]

    # expand dimesion
    trainX = np.expand_dims(trainX, axis=3)
    validX = np.expand_dims(validX, axis=3)
    testX  = np.expand_dims(testX , axis=3)

    return trainX, trainY, validX, validY, testX, testY

def data_pipeline(dataX, dataY):

        dataset = tf.data.Dataset.from_tensor_slices( (dataX, dataY) )
        dataset = dataset.shuffle(batch_size * 8)
        dataset = dataset.repeat()
        dataset = dataset.batch(batch_size)
        dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
        return dataset

class custom_model():
    def __init__(self):

        def Acc():
            acc = tf.keras.metrics.categorical_accuracy(label_ref, clf_out)
            return tf.math.reduce_mean(acc)

        def c_loss():
            loss = tf.keras.losses.categorical_crossentropy(label_ref, clf_out)
            loss = tf.math.reduce_mean(loss)
            return loss

        # declare variables
        self.init_op = tf.compat.v1.keras.initializers.he_normal()
        model_layers = [ tf.keras.layers.Conv2D(16, (3, 3), padding="same", activation="relu", kernel_initializer=self.init_op, name="clf/c1"),
                         tf.keras.layers.Conv2D(32, (3, 3), padding="same", activation="relu", kernel_initializer=self.init_op, name="clf/c2"),
                         tf.keras.layers.MaxPooling2D(pool_size=(2, 2), name="clf/p1"),
                         tf.keras.layers.Conv2D(32, (3, 3), padding="same", activation="relu", kernel_initializer=self.init_op, name="clf/c3"),
                         tf.keras.layers.Conv2D(64, (3, 3), padding="same", activation="relu", kernel_initializer=self.init_op, name="clf/c4"),
                         tf.keras.layers.MaxPooling2D(pool_size=(2, 2), name="clf/p2"),
                         tf.keras.layers.Flatten(name="clf/f1"),
                         tf.keras.layers.Dense(256, activation="relu", kernel_initializer=self.init_op, name="clf/d1"),
                         tf.keras.layers.Dense(10 , activation=None  , kernel_initializer=self.init_op, name="clf/d2"),
                         tf.keras.layers.Activation('softmax', name="clf/a1")
                        ]

        # clf_model
        clf_input = tf.keras.layers.Input(shape=(28,28,1 ), name="model/input")
        clf_out   = clf_input
        for ii in model_layers:
            clf_out = ii(clf_out)
        clf_model = tf.keras.models.Model(inputs=clf_input, outputs=clf_out, name='clf_model')
        clf_model.compile(loss='categorical_crossentropy', optimizer='Nadam', metrics=['accuracy'])


        label_ref = tf.keras.layers.Input(shape=(10,) , name='label_ref')
        clf_out = clf_model(clf_input)

        # using tf.keras.optimizers.Nadam would get error
        #optimizer = tf.keras.optimizers.Nadam(lr=0.0005)
        optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.01)
        self.train_op = optimizer.minimize(c_loss(), var_list=[clf_model.trainable_variables])

        self.clf_model = clf_model
        self.clf_input = clf_input
        self.label_ref = label_ref
        self.op_acc = Acc()
        self.c_loss = c_loss()

if __name__ == '__main__':

    # set GPU
    import os
    if os.environ.get("CUDA_VISIBLE_DEVICES") is None:
        os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    # reset tf session
    tf.compat.v1.keras.backend.clear_session()
    gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)
    sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))
    tf.compat.v1.keras.backend.set_session(sess) 

    # prepare data
    trainX, trainY, validX, validY, testX, testY = download_data()
    train_gen = data_pipeline(trainX, trainY)
    valid_gen = data_pipeline(validX, validY)
    test_gen = data_pipeline(testX, testY)

    # create a new model
    print('Make sure that we create a new model.')
    model = custom_model()
    sess.run(tf.compat.v1.global_variables_initializer())
    model.clf_model.evaluate(testX, testY, verbose=2, batch_size=batch_size)

    # train model
    num_epoch = 5
    total_len = trainY.shape[0] // batch_size
    tf_iter = tf.compat.v1.data.make_initializable_iterator(train_gen)
    tf_next = tf_iter.get_next()
    sess.run(tf_iter.initializer)
    for epoch in range(num_epoch):
        c_loss, acc = 0.0, 0.0
        for ii in range(total_len):
            X, Y = sess.run(tf_next)
            [b_c_loss, b_acc, _] = sess.run([model.c_loss, model.op_acc, model.train_op],
                                                feed_dict={ model.clf_input: X,
                                                            model.label_ref: Y,
                                                            tf.keras.backend.learning_phase(): 1})
            c_loss = c_loss + b_c_loss
            acc = acc + b_acc
        
        c_loss = c_loss / total_len
        acc = acc / total_len
        print('[Training]Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss, acc) )

    print('Show loss and accuracy with keras API')
    model.clf_model.evaluate(trainX, trainY, verbose=2, batch_size=batch_size)
    model.clf_model.evaluate(validX, validY, verbose=2, batch_size=batch_size)
    model.clf_model.evaluate(testX, testY, verbose=2, batch_size=batch_size)

    print('Show loss and accuracy with low level API')
    # evaluate
    num_epoch = 1
    total_len = validY.shape[0] // batch_size
    tf_iter = tf.compat.v1.data.make_initializable_iterator(valid_gen)
    tf_next = tf_iter.get_next()
    sess.run(tf_iter.initializer)
    for epoch in range(num_epoch):
        c_loss_t, acc_t, c_loss_f, acc_f = 0.0, 0.0, 0.0, 0.0
        for ii in range(total_len):
            X, Y = sess.run(tf_next)
            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 1})
            c_loss_t = c_loss_t + b_c_loss
            acc_t = acc_t + b_acc

            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 0})
            c_loss_f = c_loss_f + b_c_loss
            acc_f = acc_f + b_acc

        c_loss_t = c_loss_t / total_len
        c_loss_f = c_loss_f / total_len
        acc_t = acc_t / total_len
        acc_f = acc_f / total_len
        print('[Validation][learning_phase=1] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_t, acc_t) )
        print('[Validation][learning_phase=0] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_f, acc_f) )

    # evaluate
    num_epoch = 1
    total_len = testY.shape[0] // batch_size
    tf_iter = tf.compat.v1.data.make_initializable_iterator(test_gen)
    tf_next = tf_iter.get_next()
    sess.run(tf_iter.initializer)
    for epoch in range(num_epoch):
        c_loss_t, acc_t, c_loss_f, acc_f = 0.0, 0.0, 0.0, 0.0
        for ii in range(total_len):
            X, Y = sess.run(tf_next)
            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 1})
            c_loss_t = c_loss_t + b_c_loss
            acc_t = acc_t + b_acc

            [b_c_loss, b_acc] = sess.run([model.c_loss, model.op_acc],
                                        feed_dict={ model.clf_input: X,
                                                    model.label_ref: Y,
                                                    tf.keras.backend.learning_phase(): 0})
            c_loss_f = c_loss_f + b_c_loss
            acc_f = acc_f + b_acc

        c_loss_t = c_loss_t / total_len
        c_loss_f = c_loss_f / total_len
        acc_t = acc_t / total_len
        acc_f = acc_f / total_len
        print('[Testing][learning_phase=1] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_t, acc_t) )
        print('[Testing][learning_phase=0] Epoch: {:d}/{:d} - loss: {:.3f} - acc: {:.3f}'.format(epoch+1, num_epoch, c_loss_f, acc_f) )

Definitely, we got a very normal output:

Make sure that we create a new model.
10000/10000 - 1s - loss: 398.0696 - acc: 0.1151
[Training]Epoch: 1/5 - loss: 11.997 - acc: 0.558
[Training]Epoch: 2/5 - loss: 0.474 - acc: 0.849
[Training]Epoch: 3/5 - loss: 0.282 - acc: 0.914
[Training]Epoch: 4/5 - loss: 0.213 - acc: 0.935
[Training]Epoch: 5/5 - loss: 0.181 - acc: 0.945
Show loss and accuracy with keras API
55000/55000 - 1s - loss: 0.1555 - acc: 0.9535
5000/5000 - 0s - loss: 0.1501 - acc: 0.9584
10000/10000 - 0s - loss: 0.1687 - acc: 0.9539
Show loss and accuracy with low level API
[Validation][learning_phase=1] Epoch: 1/1 - loss: 0.150 - acc: 0.958
[Validation][learning_phase=0] Epoch: 1/1 - loss: 0.150 - acc: 0.958
[Testing][learning_phase=1] Epoch: 1/1 - loss: 0.169 - acc: 0.954
[Testing][learning_phase=0] Epoch: 1/1 - loss: 0.169 - acc: 0.954

Describe the expected behavior

It should work properly.

Code to reproduce the issue

Please see the section of Describe the current behavior

Other info / logs

skip ...

closed time in 2 months

CNOCycle

issue commenttensorflow/tensorflow

tf.keras.layers.BatchNormalization() may not work in tf=2.0 and eager model is disable

(Closing because we don't have the bandwidth to fix this)

CNOCycle

comment created time in 2 months

issue commenttensorflow/tensorflow

tf.keras.layers.BatchNormalization() may not work in tf=2.0 and eager model is disable

Hi @CNOCycle, really sorry about the delay! I've finally had a chance to take a look at this.

As far as I can tell, it's actually totally unrelated from the other referenced issues (about which layers are passed to keras applications). Instead, it seems like internal changes we've made to the implementation of learning_phase over time may have broken some usages of learning phase when constructing tf.compat.v1 sessions and running them.

Unfortunately, this is far enough from how we recommend people use Keras in TF2 that our team doesn't have the bandwidth to fix this. And, if we were to try fixing this it would have to wait until 2.3

You may be able to set learning phase explicitly and have it work by using keras.backend.set_learning_phase/keras.backend.learning_phase_scope (I'm not sure on that front). You may also be able to pass in training explicitly to the layers that use learning phase.

But, our overall recommendation would be: If you can't upgrade to 2.x style code and it's crucial for you to keep using sessions directly with keras layers, you should probably just use TensorFlow 1.x.

CNOCycle

comment created time in 2 months

issue commenttensorflow/tensorflow

Discrepancy in training step accuracy with Tensorflow 2.1

The next 2.2 release candidate will include this commit: https://github.com/tensorflow/tensorflow/commit/410852dbd24899e22f0020f9fdc9757f527dda55 (This fixes a bug where some keras code was accidentally referring to the wrong versions internally)

If you previously had to put layers=tf.keras.layers, this should fix your issue.

kunal4097

comment created time in 2 months

create barnchtomerk/tensorflow

branch : cherrypicks_OSWSP

created branch time in 2 months

PR opened tensorflow/tensorflow

Allow unused keys when a dict is passed to a single-input Functional …

…API model.

Ensure that the key mapping to the name of the Input is used during Model execution.

PiperOrigin-RevId: 301053395 Change-Id: I7f5bfffc3e034b064b3cd4129e07f000df11cb6b

+32 -13

0 comment

2 changed files

pr created time in 2 months

issue commenttensorflow/tensorflow

Porting codebase utilizing tf.slim to TF-2.0

Hi Draxdave@, we're sorry you've found this to be a painful experience! Here's some information that might be useful to you or others who come across this issue:

  1. That link you put is from the TF object detection library. The team that owns it is actively working on releasing the tf 2.x version of the object detection library + training scripts. If you want to use the training scripts for it that have already been released you need to use TF 1.x.

  2. Contrib is not packaged with Tensorflow 2.x. The design discussions around that decision can be found here https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md

  3. Due to it's design, TF slim does not work with eager execution enabled. That said, it has been made it available (with extremely light support) as a TF2 compatible library (as long as you call tf.compat.v1.disable_eager_execution()` in https://github.com/google-research/tf-slim

We would welcome community contributions that make tf-slim work with v2 behavior enabled.

calledbymountains

comment created time in 2 months

issue commenttensorflow/tensorflow

Optimizer clipvalue and clipnorm not working in Tensorflow 2.0

Hi Goldie, someone wants to know if a bug fix in 2.2 can be backported to 2.1.x. What's our policy on minor release versions?

Best, Tomer

On Sun, Mar 22, 2020 at 9:35 AM jlherren notifications@github.com wrote:

Oh man, I've just been debugging my NaN losses for over an hour now. How could this even go unnoticed...

Will this be fixed for 2.1.x as well or is 2.2 the only way to get this working again? I can't disable eager mode because then I lose the cuDNN implementation of LSTMs.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/33929#issuecomment-602236143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFFEBMU4OQKDECN2UCLUQTRIY45FANCNFSM4JICXMIA .

bloc97

comment created time in 2 months

issue commenttensorflow/tensorflow

Optimizer clipvalue and clipnorm not working in Tensorflow 2.0

Yes, norm clipping still worked correctly when eager execution was disabled in 2.0 and 2.1.

bloc97

comment created time in 3 months

IssuesEvent

issue closedtensorflow/addons

Addons Sparsemax gets different results than Contrib sparsemax

System information

  • TensorFlow version and how it was installed (source or binary): Nightly
  • TensorFlow-Addons version and how it was installed (source or binary): git commit 8bacf4d434974b8821a598aa4d1c9534a6581a5e

Describe the bug We've gotten a google-internal bug report that sparsemax implementation in tensorflow addons: https://github.com/tensorflow/addons/blob/8bacf4d434974b8821a598aa4d1c9534a6581a5e/tensorflow_addons/activations/sparsemax.py#L25

Produces different results than the sparsemax implementation that was in contrib: https://github.com/tensorflow/tensorflow/blob/23c218785eac5bfe737eec4f8081fd0ef8e0684d/tensorflow/contrib/sparsemax/python/ops/sparsemax.py#L30

For the user's model, the contrib implementation produced better results than the addons implementation.


Are these differences expected? Or is it possible that there is a bug in the addons sparsemax implementation?

closed time in 3 months

tomerk

issue commenttensorflow/addons

Addons Sparsemax gets different results than Contrib sparsemax

I don't think we've been able to get a reproducible example, so I'll close this issue for now. Will reopen if we get an example.

tomerk

comment created time in 3 months

pull request commenttensorflow/tensorflow

Add custom_objects in the function layer_test

I updated the description because I think the curly braces in the code snippets are causing issues with the merging.

@gbaned how do we restart the merging? Here is the code snippet I removed:

Usage:

testing_utils.layer_test(
    SyncBatchNormalization,
    # keras.layers.BatchNormalization,
    kwargs={
      'momentum': 0.9,
      'epsilon': 0.1,
      'gamma_regularizer': keras.regularizers.l2(0.01),
      'beta_regularizer': keras.regularizers.l2(0.01)
    },
    input_shape=(3, 4, 2),
    custom_objects={'SyncBatchNormalization': SyncBatchNormalization})
MokkeMeguru

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:

There's definitely room for non-parameter-server-specific utilities here for side car evaluation. Plenty of users (e.g. object detection) want to use dedicated evaluation jobs with mirroredstrategy, tpustrategy, & others.

yuefengz

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:+    latest_checkpoint = get_new_checkpoint()+    checkpoint.restore(latest_checkpoint)+    eval_result = eval_fn(iterator)  # Users can print, early stop, mark ckpt..+```+++In the evaluation client, the user loads the checkpoints that were periodically saved into the model (by the training client), does evaluation over a full pass of eval dataset, and does whatever they want to do with eval results. Examples include exporting them to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.)

So the training job and sidecar evaluation have two-way communication, all via the file system? Does this make sense?

It might since the training client can run asynchronously then somehow signal to workers, but it feels a little unwieldy...

How hard would it be to allow multiple clients/jobs connect to the same parameter server? Then they could communicate via reads/writes directly to the parameter server.

yuefengz

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:+    latest_checkpoint = get_new_checkpoint()+    checkpoint.restore(latest_checkpoint)+    eval_result = eval_fn(iterator)  # Users can print, early stop, mark ckpt..+```+++In the evaluation client, the user loads the checkpoints that were periodically saved into the model (by the training client), does evaluation over a full pass of eval dataset, and does whatever they want to do with eval results. Examples include exporting them to files which can be read by the training client for actions (such as reducing learning rate, early stopping, etc.)+++## Implementation+++### Low-level Primitives++We can potentially expose them in the future when they are more stable and when we want to allow more advanced use cases.++We will have `Cluster` and `Worker` classes to encapsulate logic related to remote function scheduling.+++```python+class Cluster(object):++  def __init__(self, cluster_resolver, failure_handler=None):+    """Create the cluster instance and connect to the remote cluster."""+    pass++  @property+  def workers(self):+    """Return all available workers."""+    return self._workers++  def schedule(self, function, args=None, kwargs=None):+    """Schedule the function on one worker.++    It adds the function to the global scheduling queue and returns future+    objects immediately.+    """+    pass++  def join(self):+    """Block until all scheduled functions are complete."""+    pass+```+++We will probably merge this `Worker` with executors.+++```python+class Worker(object):++  def __init__(self,+               worker_job_name,+               cluster,+               max_scheduled_functions=100):+    """Create a scheduling queue and a thread that processes the queue."""+    pass++  def schedule(self, function, args=None, kwargs=None):+    """Schedule the function on the worker.++    It adds the function to the scheduling queue. It returns Future object+    immediately unless the scheduling queue is full.+    """+    pass++  def healthy(self):+    """Return a boolean indicating whether the worker is health or not."""+    pass++  def _set_dead(self):+    """Declare the worker is dead and poison all future objects."""+    pass++  def _rebuild_resources(self):+    """Rebuild worker-local resources when it is recovered from failure."""+    pass+```+++As we mentioned the return value of `schedule` will be `Future` objects if `function` has return values. The `Future` works as a container and will be later-binded with states of either success or complete failure. Overall, this `Future` class has the following benefits:++++*   It allows `schedule` method return immediately after pushing functions to its scheduling queue. It allows these methods to return without needing to wait for acknowledgement from workers.+*   It serves as the container for values or errors. It would be binded with a value or an error later. When it is rebuilt, we can replace its underlying value silently.+*   When being passed to `local_result`, we flag it to indicate that this value needs to be fetched eagerly.+*   (Future work) It captures the lineage between functions and return values so that we can rebuild any poisoned objects. ++```python+class Future(object):

So no on_complete asynchronous callbacks on futures? The clients check the value of futures exclusively by joining/polling?

yuefengz

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:+++```python+if cluster_resolver.task_type == "chief":+  run_training_loop()+elif cluster_resolver.task_type == "evaluator":+  run_evaluation_loop()+```+++Evaluation code:+++```python+def run_evaluation_loop(...):+  """Run the example custom evaluation loop."""+  +  eval_dataset, model, eval_accuracy = ...+  checkpoint = tf.train.Checkpoint(model=model)++  @tf.function+  def eval_fn(eval_dataset):+    for _ in range(eval_steps):+      # evaluation pass+    return eval_accuracy.result()++  while True:

The common patterns are: Wait until a checkpoint is present Evaluate Check if a new one is present, if not wait a while until it is Finish evaluating when no new checkpoint has been written for a while

With timeouts, meaningful error messages, etc.

yuefengz

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:

Depending on the consistency guarantees we make for the parameter server: Would it be possible (or even make sense) for the parameter server to track which functions have and haven't been completed?

This may let us do things like know exactly which functions need to be re-executed if a parameter server fails and a checkpoint needs to be reloaded.

Then again, if people are training asynchronously I guess it's not necessarily worth trying to provide reproducibility guarantees, or guarantees that each step will be run exactly once.

yuefengz

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.

So just to clarify: The parameter server won't be shared by the training & evaluation job. Instead, they'll do what estimator does and communicate via checkpoints & the file system?

yuefengz

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())+    time.sleep(60)+# … omitted+```++#### Error Reporting From `replica_fn`++Because of the non-blocking `schedule`, any exception raised in `replica_fn` wouldn’t be returned to users immediately. Actually an exception may pollute arbitrary number of functions in flight following the culprit function. We will set the error in returned `Future` objects for the culprit function and these polluted functions and we will raise exceptions when `join` is called.++Therefore the best practice for users is to avoid writing any code that may raise in `replica_fn`:++*   use repeated dataset so `OutOfRangeError` will be avoided;+*   avoid using assertion ops or some debugging ops like `tf.debugging.check_numerics`.+++#### Dataset Interface++The traditional training loop of `tf.distribute` passes the `get_next` results of a distributed iterator to `replica_fn`:+++```python+for x in enumerate(distributed_iter):+  loss = strategy.schedule(replica_fn, x, y)+```+++If we do the same thing with the `strategy.schedule` API, there are several challenges.++The first challenge is we don’t know which worker the `get_next` should return to since where the `replica_fn` will be executed will be decided later. Some later-binding mechanism can be explored.++The second challenge is calling `get_next` on an iterator is synchronous. This means that the training loop is not truly asynchronous. It is tricky to make `get_next` asynchronous because the client doesn’t know how many items will be in the iterator and thus doesn’t know how many functions to schedule. +++##### Alternative: passing iterators to `strategy.schedule`++The following training loop is less consistent with other `tf.distribute` examples but is easier to implement in the short term. It requires users to explicitly set a number of steps.+++```python+# … omitted+with strategy.scope():+  # … omitted+  distributed_iter = iter(distributed_dataset)+  for i in range(total_steps):+    strategy.schedule(replica_fn, args=(distributed_iter,))+# … omitted+```+++**We will start with this kind of training loop in our first version.**+++### Fault Tolerance+++#### Task Failure+++##### Worker failure+++###### When scheduling++When a worker fails, our training will continue without this failed worker. Functions scheduled on a failed worker will be rescheduled on other workers. ++For functions that bound to a specific worker, e.g. resource creation function, they will be queued until the worker is back.++When the failed worker is back, we will update the cluster configuration with `context.update_server_def` which would also reset all the states. After resources on the restarted worker are built, we can resume scheduling functions on the worker. +++###### When materialing a `Future` object++It is possible that a function is executed but its corresponding worker fails when users try to consume its output. In this case, we will give users a `None` value and set an error in the `Future` object.++We can mitigate the problem by eagerly materializing function outputs when they are passed to `local_results`. ++We can explore mechanisms to recover these objects in the future. In the short-term, users can choose to write the results to variables on parameter servers, just like a Keras metric.+++##### Parameter server failure++When a parameter server fails, the error will be propagated to the client via workers. Since the latest values of variables on the failed parameter servers are gone, there is no way for the client to recover them. Therefore the training will pause until the failed parameter server is back. The client then needs to clean up other variables on other parameter servers, rebuild all the variables and load variable values from a checkpoint. To trigger this process, the simplest method is to restart the client as well.+++##### Client failure++When a client fails, some scheduled functions will continue to run on workers. No new functions will be scheduled. When the client comes back, it will create variables, load from a checkpoint, schedule functions with a new context id. All the old variables will be garbage-collected when we reset their eager contexts.++#### Resource Management for Workers++When a worker has recovered from failure, we will need to rebuild iterators, worker-local variables, lookup tables and other resources on that worker that don’t need to be read from a checkpoint. This means that the client will have to keep track of these iterators, worker-local variables and other resources.++Keeping track of resources and rebuilding them will be achieved depending how users create their resources:+++*   we will record iterators created via `tf.distribute`’s API; The state of a rebuilt iterator will be lost. We can recover their states as future work.+*   we will capture the creation of worker-local variables via variable creator scopes.+*   in the future we will provide users an API to create worker-local resources. We will capture these resources in the API.++If users create iterators or other resources inside a function but don’t expose them as outputs, we will not rebuild them.+++#### The Unknown of Scheduled Functions++For functions that have been scheduled, it is difficult for the client to know whether they have actually been executed or not when the client detects their corresponding worker failure. Therefore, in addition to inform users of this uncertainty in the case of worker failure, we should do the following to reduce this uncertainty:+++*   keep the number of scheduled but not executed functions small. This may be difficult to achieve since there is not an easy way for the client to know whether a function is executed or not. The only way is to synchronize the executor. Therefore, as a workaround we will have to periodically synchronize the executor to make sure functions are actually executed, before the client schedules more functions. In the long run, we should get acknowledgement from runtime about how many functions have been executed.+*   eagerly fetch the outputs of remote functions once the outputs are passed to `strategy.local_result`. In this way, we can know the status of function execution earlier.+*   recommend users schedule only small functions. Large functions are more expensive to retry.+++### Evaluation++Historically, `tf.estimator.Estimator` uses a dedicated evaluator that periodically loads from a checkpoint, and performs evaluation with evaluation data. However `tf.keras` typically evaluates in an alternating manner after every epoch of training, and this is also the case with `tf.keras` + `MultiWorkerMirroredStrategy`. ++With `ParameterServerStrategyV2`, we will start with a dedicated** evaluator that runs alongside the training cluster**, **aka “sidecar evaluation”**; in this scheme, training client is required to generate checkpoints periodically, and the evaluator reads the latest checkpoint as it becomes available. The evaluation is asynchronous to the training progress. With this we provide the functionality Estimator has been able to with Keras API, which is important to attract updates from Estimator users to TF 2.0.++With our recommendation, users should create a separate evaluation client that runs the same python binary as the training client. This python binary will contain the if-else clause as it bifurcates into two paths:

Given it's two separate jobs, why require it be the same python binary instead of having two separate python binaries? Internally we have pretty good tooling for bringing up one binary as a training job and another as an evaluation job.

yuefengz

comment created time in 3 months

Pull request review commenttensorflow/community

Single-Client Parameter Server Training

+# Single-client Parameter Server Training++| Status        | Proposed                                             |+:-------------- |:---------------------------------------------------- |+| **Author(s)** | Yuefeng Zhou (yuefengz@google.com), Rick Chao (rchao@google.com) |+| **Sponsor**   | Priya Gupta (priyag@google.com)                      |+| **Updated**   | 2018-03-06                                           |++## Background++Parameter server training is a very commonly used distributed training architecture. It is especially relevant to models with large embeddings, for training on large clusters of machines with CPUs, or when scalability is preferred over determinism. Its high-level idea is to create variables on parameter servers and in each step let workers take different training input, pull variable values from parameter servers, compute gradients and send them to parameter servers.+++### Distribution Strategy++Distribution Strategy (`tf.distribute.Strategy`) is a library that aims to allow users to write simple model code and scale up their models automatically with decent out-of-box performance. We will design parameter server training under the umbrella of `tf.distribute.Strategy` in TensorFlow 2 with a single-client architecture, in contrast to the multi-client approach which is traditionally done in TensorFlow distributed training such as `tf.estimator.Estimator` or `tf.distribute.experimental.MultiWorkerMirroredStrategy`.++Distribution Strategy’s [custom training loop](https://www.tensorflow.org/tutorials/distribute/custom_training) (CTL) API has been popular among users who want  more control in writing their training loops. User community of this API is large. We would like to focus on supporting CTL API first and later abstract out a commonly used pattern for Keras `compile`/`fit` API.+++### Single-Client Distributed Training++We believe that a single-client architecture can provide a simpler programming model than multi-client setup. A single source of truth can avoid bugs due to inconsistencies in multi-client setup. Furthermore, a single source of control can enable more determinism. In extreme cases, it can launch long-running tasks and turn into multi-client effectively.+++## Goal++The goal of this project is to support multi-worker asynchronous training with `ParameterServerStrategy` and CTL API, and in the long term also Keras `model.fit()`. In the first stage of this project, we focus more on design ideas rather than the APIs.++The goal of this document is to discuss the high-level design and challenges of various pieces needed to support single-client parameter server training. Detailed designs for some pieces may be directed to other documents.+++## Overview+++### Programming Model++With a single-client architecture, the programming model will be different than the multi-client architecture. All workers and parameter servers are standard TensorFlow servers, and the user-defined program will run on the client only. Generally, no matter what high-level APIs users will use, the workflow for running a step function distributedly with single-client approach includes the following steps:+++1. Connect to all remote workers and parameter servers. +2. Create variables on parameter servers and hold references to them.+3. Create datasets and iterators on workers.+4. Create the replica function that takes an iterator as input, trace it and register it on all workers.  Note: a function may create variables as well. If not specified, they will be created on parameter servers as well.+5. Dispatch the step function on one available worker.+6. Repeat 5 until the end of epoch.+7. Repeat 5 - 6 until the stop criteria is reached.+++### Interfaces++One of our goals is to make `ParameterServerStrategy`’s API consistent with other strategies so that users can easily switch between them. This may be challenging due to the fundamental difference between synchronous and asynchronous training. Therefore, we try to use most of the Distribution Strategy’ APIs but occasionally we will create APIs that still make sense in other strategies as well.++**Note: all proposed APIs in this document are tentative and only valid for our first version. We will revisit them once we get enough feedback from users and the community.**+++#### Constraints++Function is first-class citizen. Users should only schedule functions instead of running individual ops, in addition to creating variables.++Users can occasionally run individual ops on the client, only for reporting purposes such as printing a metric’s value.+++#### Schedule/Join Primitives++The `strategy.run` API was initially developed for synchronous training. We propose a new pair of primitives to+++*   hide the details of load-balancing, fault tolerance and dynamic scheduling+*   expose the non-blocking semantics to users.++To enable scheduling a function on any worker, we recommend users create the same dataset, but may shuffle differently, on all workers via `strategy.experimental_distribute_datasets_from_function` API.+++```python+class ParameterServerStrategyV2:+  +  def schedule(self, replica_fn, args=(), kwargs=()):+    """Schedule the `replica_fn` on all replicas in a sync group (a worker).+   +    Schedule the `replica_fn` on all replicas in a sync group that is available,    +    returns a future of PerReplica immediately if `function` has return values.++    It implements at-least-once semantics for function execution. If a worker+    fails, it will try to reschedule the function on another replica group or throw     +    an exception to users. So this method assumes that function execution can be +    out of order and function inputs are shared between sync groups. ++    We don't support the cases where `args` or `kwargs` are bound to a specific+    sync group. We will consider supporting them in the future.++    If there are barriers in `replica_fn`, it is users' responsibility to make+    sure they won't cause deadlock.+    """+    pass+  +  def join(self):+    """Wait until all scheduled functions are finished.++    Raises an error if any of the functions fails to execute. In this case,+    there is no guarantee that non-failing functions will complete.++    When join() is being called, it is not allowed to call `schedule`.+    """+    pass++  def done(self):+    """Returns True if there is no pending functions to be executed."""+    pass++  def local_results(self, future_list):+    """Get concrete values of the future list.+    +    Poisoned future objects will give `None`.+    """+    pass+```+++#### Custom Training Loop++To construct a custom training loop, users need to+++*   use `strategy.experimental_distribute_datasets_from_function` to create one dataset per worker. At this point, we recommend against using `strategy.experimental_distribute_dataset`.+*   create models under `strategy.scope` so variables will be assigned to parameter servers.+*   likewise, create a Keras metric object under `strategy.scope`. We expect the metric variables to be stored on parameter servers. Each worker, within their `replica_fn`, updates the metric states.+*   use `strategy.schedule` to schedule the `replica_fn` on one remote worker. This `replica_fn` should take an iterator and perform forward and backward computation. This `strategy.schedule` will only schedule this `replica_fn` and returns one or several `Future` objects immediately.  +*   use `strategy.local_results` to get concrete values of results returned by `strategy.schedule`. This may be a blocking call if the result is not yet ready. With any failure that cannot be handled will be ignored and as a result some of the results may be `None`.+*   call `strategy.join` to wait until all scheduled functions are executed.++```python+# Connect to remote servers with a user-provided `ClusterResolver` object.+strategy = ParameterServerStrategyV2(cluster_resolver)++dataset_fn = # a function that returns a dataset++# Clone the dataset on all workers, shuffled with different seeds.+distributed_dataset = strategy.experimental_distribute_datasets_from_function(+      dataset_fn)++with strategy.scope():+  # Create variables on parameter servers in a round-robin fashion.+  model = create_model()+  optimizer = tf.keras.optimizers.Adam()+  accuracy = tf.keras.metrics.CategoricalAccuracy(name="train_accuracy")++  @tf.function+  def replica_fn(iterator):+    x, y = next(iterator)+    with tf.GradientTape() as tape:+      predictions = model(x, table, training=True)+      loss = compute_loss(y, predictions)+    gradients = tape.gradient(loss, model.trainable_variables)+    optimizer.apply_gradients(zip(gradients, model.trainable_variables))+    accuracy.update_state(y, predictions)+    return loss++  for _ in range(num_epoches):+    distributed_iter = iter(distributed_dataset)+    for i in range(steps_per_epoch):+      # strategy.schedule pushes a closure in the scheduling queue and+      # returns a list of future objects immediately.+      loss = strategy.schedule(replica_fn,+                               args=(distributed_iter,))+    strategy.join()+    model.save()  # save checkpoint/summary...+    print ("Loss = %f, accuracy = %f" % (+        strategy.local_results(loss), accuracy.result()))+```+++##### Alternative training loop: fully async++Another option from calling `join` every epoch, users can choose to schedule all steps and then asynchronously print metric values. This option doesn’t require any synchronization in epoch boundaries.+++```python+# … omitted+with strategy.scope():+  # … omitted+  for _ in range(total_steps)):+    strategy.schedule(step_fn, args=(iterators,))++  # Print accuracy value every one minute.+  while not strategy.done():+    print("Current accuracy: %f" % accuracy.result())

What are the consistency/atomicity guarantees we're aiming for?

E.g. computing accuracy may require 2 (or more) variables. Depending on the parameter server updates, the metric could be computed from an invalid state.

yuefengz

comment created time in 3 months

issue commenttensorflow/tensorflow

tf throw error and not may get correct gradient when eager_mode is disable in tf=2.0.0

I'm assuming this has been resolved by the same answer as for https://github.com/tensorflow/tensorflow/issues/35107

Where: If you disable eager execution you should be importing tensorflow.compat.v1 as tf instead of tensorflow as tf, to make sure the APIs & behaviors match?

(Feel free to re-open if that's not the case!)

CNOCycle

comment created time in 3 months

issue commenttensorflow/addons

Addons Sparsemax gets different results than Contrib sparsemax

I've passed along the request for a reproducible example, will follow up if they manage to make one.

tomerk

comment created time in 3 months

issue openedtensorflow/addons

Addons Sparsemax gets different results than Contrib sparsemax

System information

  • TensorFlow version and how it was installed (source or binary): Nightly
  • TensorFlow-Addons version and how it was installed (source or binary): git commit 8bacf4d434974b8821a598aa4d1c9534a6581a5e

Describe the bug We've gotten a google-internal bug report that sparsemax implementation in tensorflow addons: https://github.com/tensorflow/addons/blob/8bacf4d434974b8821a598aa4d1c9534a6581a5e/tensorflow_addons/activations/sparsemax.py#L25

Produces different results than the sparsemax implementation that was in contrib: https://github.com/tensorflow/tensorflow/blob/23c218785eac5bfe737eec4f8081fd0ef8e0684d/tensorflow/contrib/sparsemax/python/ops/sparsemax.py#L30

For the user's model, the contrib implementation produced better results than the addons implementation.


Are these differences expected? Or is it possible that there is a bug in the addons sparsemax implementation?

created time in 3 months

Pull request review commenttensorflow/tensorflow

Add custom_objects in the function layer_test

 def layer_test(layer_cls, kwargs=None, input_shape=None, input_dtype=None,       string or integer values.     adapt_data: Optional data for an 'adapt' call. If None, adapt() will not       be tested for this layer. This is only relevant for PreprocessingLayers.-+    custom_objects: Optional Objects for your custom layer. If you write a +      your layer, you can use this variable.

I think @yhliang2018 meant that you should update this description to instead say:

custom_objects: Optional dictionary mapping name strings to custom objects in the layer class. This is helpful for testing custom layers.

Then we can go ahead and approve this.

MokkeMeguru

comment created time in 3 months

issue closedtensorflow/tensorflow

lower accuracy when model.compile() inside strategy.scope() (TPU)

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): - TensorFlow version (use command below): TF 2.1.0
  • Python version: - Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: - GPU model and memory: N/A

Describe the current behavior

I am running model training in a GCP VM by using TPU v3-8.

When compiling a model outside strategy.scope(), during training, the validation 'sparse_categorical_accuracy' is always about 2 ~ 3 % higher than compiling the model inside strategy.scope().

The log provided below is for sequential models (created along with some pre-trained image classification models). However, I also tried to use subclassed model of tf.keras.models.Model. This time, compiling inside / outside strategy.scope() gives almost the same results, which is equivalent to sequential models with compiling inside strategy.scope() --> That means, a lower accuracy than sequential models with compiling outside strategy.scope().

For subclassed model, I also tried to use custom distributed training loop, which also gives the lower accuracy.

I can provide the logs for using subclassed class, if you think it's helpful. (I need to re-run the test for this part.)

Describe the expected behavior

I expect both compiling inside or outside strategy.scope() having the same performance.

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

(If necessary, I can provide the whole script. It's some python code for the Kaggle competition https://www.kaggle.com/c/flower-classification-with-tpus.

For compiling outside strategy.scope()

from tensorflow.keras.applications import Xception
from tensorflow.keras.applications import DenseNet201
from tensorflow.keras.applications import ResNet152V2

tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=TPU_WORKER, zone=ZONE, project=PROJECT)  # TPU detection. No parameters necessary if TPU_NAME environment variable is set. On Kaggle this is always the case.
print('Running on TPU ', tpu.master())

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)

backend_model_name = "ResNet152V2"

with strategy.scope():

    flower_classifier = tf.keras.Sequential(
        [
            backend_model(weights='imagenet', include_top=False ,input_shape=(*IMAGE_SIZE, 3)),
            tf.keras.layers.GlobalAveragePooling2D(),
            tf.keras.layers.Dense(len(CLASSES), activation='softmax', name='prob_dist')
        ]
    )

flower_classifier.compile(
    optimizer=tf.keras.optimizers.Adam(lr=0.00001),
    loss = 'sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy']
)

history = flower_classifier.fit(
    get_training_dataset(),
    steps_per_epoch=STEPS_PER_EPOCH,
    epochs=epochs,
    validation_data=get_validation_dataset(ordered=True),
    validation_steps=eval_steps
)

For compiling inside strategy.scope()

with strategy.scope():

    flower_classifier = tf.keras.Sequential(
        [
            backend_model(weights='imagenet', include_top=False ,input_shape=(*IMAGE_SIZE, 3)),
            tf.keras.layers.GlobalAveragePooling2D(),
            tf.keras.layers.Dense(len(CLASSES), activation='softmax', name='prob_dist')
        ]
    )

    flower_classifier.compile(
        optimizer=tf.keras.optimizers.Adam(lr=0.00001),
        loss = 'sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy']
    )

For subclassed model, it's like

class Flower_Classifier(tf.keras.models.Model):

def __init__(self, backend_model):

    super(Flower_Classifier, self).__init__()

    self.image_embedding_layer = backend_model(weights='imagenet', include_top=False ,input_shape=(*IMAGE_SIZE, 3))
    self.pooling_layer = tf.keras.layers.GlobalAveragePooling2D()
    self.prob_dist_layer = tf.keras.layers.Dense(len(CLASSES), activation='softmax', name='prob_dist')

def call(self, inputs, training=False):

    embedding = self.image_embedding_layer(inputs, training=training)
    pooling = self.pooling_layer(embedding)
    prob_dist = self.prob_dist_layer(pooling)

    return prob_dist

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

I run test on 3 models (DenseNet201, Xception, ResNet152V2). Each is model run 3 times for both compiling inside/outside strategy.scope(). Each run is a training of 30 epochs. The following are the best 10 validation accuracy for each run.

DenseNet201:

    compile inside strategy.scope():

        run 1: val_acc = [0.9094828, 0.90921336, 0.90921336, 0.90894395, 0.90867454, 0.9081358, 0.9059806, 0.90113145, 0.90005386, 0.89897627]
        run 2: val_acc = [0.90894395, 0.90894395, 0.90678877, 0.9040948, 0.9038254, 0.9038254, 0.9016703, 0.9016703, 0.9005927, 0.90005386]
        run 3: val_acc = [0.907597, 0.9073276, 0.9070582, 0.90678877, 0.9065194, 0.904903, 0.9038254, 0.90328664, 0.90005386, 0.8981681]

    compile outside strategy.scope():

        run 1: val_acc = [0.9288793, 0.92672414, 0.92672414, 0.92672414, 0.92456895, 0.92456895, 0.92456895, 0.92456895, 0.92241377, 0.92241377]
        run 2: val_acc = [0.92672414, 0.92456895, 0.92456895, 0.92456895, 0.92241377, 0.92241377, 0.92025864, 0.92025864, 0.91810346, 0.91810346]
        run 3: val_acc = [0.93318963, 0.93318963, 0.9288793, 0.9288793, 0.92672414, 0.92672414, 0.92456895, 0.92241377, 0.92241377, 0.92025864]

ResNet152V2:

    compile inside strategy.scope():

        run 1: val_acc = [0.828125, 0.8270474, 0.8248922, 0.82435346, 0.82408404, 0.82408404, 0.82300645, 0.82300645, 0.8211207, 0.8200431]
        run 2: val_acc = [0.8278556, 0.8273168, 0.8257004, 0.8235453, 0.82327586, 0.82327586, 0.82192886, 0.8213901, 0.8189655, 0.8184267]
        run 3: val_acc = [0.828125, 0.8262392, 0.82543105, 0.82516164, 0.82462287, 0.8224677, 0.8221983, 0.81977373, 0.81869614, 0.8146552]

    compile outside strategy.scope():

        run 1: [0.85775864, 0.85775864, 0.8512931, 0.8491379, 0.8469828, 0.8448276, 0.8426724, 0.8405172, 0.8405172, 0.83836204]
        run 2: [0.8448276, 0.8426724, 0.8405172, 0.83836204, 0.8362069, 0.8340517, 0.8340517, 0.8340517, 0.83189654, 0.8275862]
        run 3: [0.86422414, 0.86206895, 0.86206895, 0.85991377, 0.85775864, 0.85560346, 0.85560346, 0.85560346, 0.85560346, 0.8534483]

Xception:

    compile inside strategy.scope():

        run 1: val_acc = [0.8445582, 0.8418642, 0.83943963, 0.83432114, 0.83162713, 0.83081895, 0.82327586, 0.81869614, 0.8125, 0.80630386]
        run 2: val_acc = [0.8504849, 0.84886855, 0.8448276, 0.8445582, 0.83943963, 0.8356681, 0.8313578, 0.8213901, 0.81977373, 0.8127694]
        run 3: val_acc = [0.8507543, 0.8483297, 0.84428877, 0.83863145, 0.8370151, 0.8292026, 0.82273704, 0.8170797, 0.80953664, 0.8036099]

    compile outside strategy.scope():

        run 1: val_acc = [0.8836207, 0.8771552, 0.8728448, 0.8685345, 0.8663793, 0.86206895, 0.85991377, 0.8491379, 0.83836204, 0.8362069]
        run 2: val_acc = [0.8814655, 0.8771552, 0.875, 0.8728448, 0.8663793, 0.85991377, 0.85560346, 0.8491379, 0.8405172, 0.82974136]
        run 3: val_acc = [0.87068963, 0.86422414, 0.86206895, 0.85775864, 0.8512931, 0.8426724, 0.8426724, 0.8426724, 0.83189654, 0.8125]

closed time in 3 months

chiapas

issue commenttensorflow/tensorflow

lower accuracy when model.compile() inside strategy.scope() (TPU)

Hi @chiapas , it was a mix of issues:

  1. Metrics have to be used in the same strategy they were created in, or it can cause issues with TPUStrategy (because if you do it wrong you'll only get the metric result from one replica instead of the overall metric aggregated across all of your replicas). Nightly should throw an error if it detects this now (on TPUStrategy). It did not use to before.

  2. Even if a model was created in a distribution strategy scope (and captures that distribution strategy), compile wasn't entering the captured scope. Because compile sometimes creates metrics (if you pass in string names), this meant the metrics compile created were not in the model's distribution strategy scope.

We've now made it so that:

  • compile enters the distribution strategy scope the model was created in, allowing the metrics to be created with the correct distribution strategy
  • If you explicitly pass in a pre-constructed metric to compile that was created in the wrong scope, compile will raise an error.
chiapas

comment created time in 3 months

more