profile
viewpoint

Ask questionslower accuracy when model.compile() inside strategy.scope() (TPU)

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): - TensorFlow version (use command below): TF 2.1.0
  • Python version: - Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: - GPU model and memory: N/A

Describe the current behavior

I am running model training in a GCP VM by using TPU v3-8.

When compiling a model outside strategy.scope(), during training, the validation 'sparse_categorical_accuracy' is always about 2 ~ 3 % higher than compiling the model inside strategy.scope().

The log provided below is for sequential models (created along with some pre-trained image classification models). However, I also tried to use subclassed model of tf.keras.models.Model. This time, compiling inside / outside strategy.scope() gives almost the same results, which is equivalent to sequential models with compiling inside strategy.scope() --> That means, a lower accuracy than sequential models with compiling outside strategy.scope().

For subclassed model, I also tried to use custom distributed training loop, which also gives the lower accuracy.

I can provide the logs for using subclassed class, if you think it's helpful. (I need to re-run the test for this part.)

Describe the expected behavior

I expect both compiling inside or outside strategy.scope() having the same performance.

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

(If necessary, I can provide the whole script. It's some python code for the Kaggle competition https://www.kaggle.com/c/flower-classification-with-tpus.

For compiling outside strategy.scope()

from tensorflow.keras.applications import Xception
from tensorflow.keras.applications import DenseNet201
from tensorflow.keras.applications import ResNet152V2

tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=TPU_WORKER, zone=ZONE, project=PROJECT)  # TPU detection. No parameters necessary if TPU_NAME environment variable is set. On Kaggle this is always the case.
print('Running on TPU ', tpu.master())

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)

backend_model_name = "ResNet152V2"

with strategy.scope():

    flower_classifier = tf.keras.Sequential(
        [
            backend_model(weights='imagenet', include_top=False ,input_shape=(*IMAGE_SIZE, 3)),
            tf.keras.layers.GlobalAveragePooling2D(),
            tf.keras.layers.Dense(len(CLASSES), activation='softmax', name='prob_dist')
        ]
    )

flower_classifier.compile(
    optimizer=tf.keras.optimizers.Adam(lr=0.00001),
    loss = 'sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy']
)

history = flower_classifier.fit(
    get_training_dataset(),
    steps_per_epoch=STEPS_PER_EPOCH,
    epochs=epochs,
    validation_data=get_validation_dataset(ordered=True),
    validation_steps=eval_steps
)

For compiling inside strategy.scope()

with strategy.scope():

    flower_classifier = tf.keras.Sequential(
        [
            backend_model(weights='imagenet', include_top=False ,input_shape=(*IMAGE_SIZE, 3)),
            tf.keras.layers.GlobalAveragePooling2D(),
            tf.keras.layers.Dense(len(CLASSES), activation='softmax', name='prob_dist')
        ]
    )

    flower_classifier.compile(
        optimizer=tf.keras.optimizers.Adam(lr=0.00001),
        loss = 'sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy']
    )

For subclassed model, it's like

class Flower_Classifier(tf.keras.models.Model):

def __init__(self, backend_model):

    super(Flower_Classifier, self).__init__()

    self.image_embedding_layer = backend_model(weights='imagenet', include_top=False ,input_shape=(*IMAGE_SIZE, 3))
    self.pooling_layer = tf.keras.layers.GlobalAveragePooling2D()
    self.prob_dist_layer = tf.keras.layers.Dense(len(CLASSES), activation='softmax', name='prob_dist')

def call(self, inputs, training=False):

    embedding = self.image_embedding_layer(inputs, training=training)
    pooling = self.pooling_layer(embedding)
    prob_dist = self.prob_dist_layer(pooling)

    return prob_dist

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

I run test on 3 models (DenseNet201, Xception, ResNet152V2). Each is model run 3 times for both compiling inside/outside strategy.scope(). Each run is a training of 30 epochs. The following are the best 10 validation accuracy for each run.

DenseNet201:

    compile inside strategy.scope():

        run 1: val_acc = [0.9094828, 0.90921336, 0.90921336, 0.90894395, 0.90867454, 0.9081358, 0.9059806, 0.90113145, 0.90005386, 0.89897627]
        run 2: val_acc = [0.90894395, 0.90894395, 0.90678877, 0.9040948, 0.9038254, 0.9038254, 0.9016703, 0.9016703, 0.9005927, 0.90005386]
        run 3: val_acc = [0.907597, 0.9073276, 0.9070582, 0.90678877, 0.9065194, 0.904903, 0.9038254, 0.90328664, 0.90005386, 0.8981681]

    compile outside strategy.scope():

        run 1: val_acc = [0.9288793, 0.92672414, 0.92672414, 0.92672414, 0.92456895, 0.92456895, 0.92456895, 0.92456895, 0.92241377, 0.92241377]
        run 2: val_acc = [0.92672414, 0.92456895, 0.92456895, 0.92456895, 0.92241377, 0.92241377, 0.92025864, 0.92025864, 0.91810346, 0.91810346]
        run 3: val_acc = [0.93318963, 0.93318963, 0.9288793, 0.9288793, 0.92672414, 0.92672414, 0.92456895, 0.92241377, 0.92241377, 0.92025864]

ResNet152V2:

    compile inside strategy.scope():

        run 1: val_acc = [0.828125, 0.8270474, 0.8248922, 0.82435346, 0.82408404, 0.82408404, 0.82300645, 0.82300645, 0.8211207, 0.8200431]
        run 2: val_acc = [0.8278556, 0.8273168, 0.8257004, 0.8235453, 0.82327586, 0.82327586, 0.82192886, 0.8213901, 0.8189655, 0.8184267]
        run 3: val_acc = [0.828125, 0.8262392, 0.82543105, 0.82516164, 0.82462287, 0.8224677, 0.8221983, 0.81977373, 0.81869614, 0.8146552]

    compile outside strategy.scope():

        run 1: [0.85775864, 0.85775864, 0.8512931, 0.8491379, 0.8469828, 0.8448276, 0.8426724, 0.8405172, 0.8405172, 0.83836204]
        run 2: [0.8448276, 0.8426724, 0.8405172, 0.83836204, 0.8362069, 0.8340517, 0.8340517, 0.8340517, 0.83189654, 0.8275862]
        run 3: [0.86422414, 0.86206895, 0.86206895, 0.85991377, 0.85775864, 0.85560346, 0.85560346, 0.85560346, 0.85560346, 0.8534483]

Xception:

    compile inside strategy.scope():

        run 1: val_acc = [0.8445582, 0.8418642, 0.83943963, 0.83432114, 0.83162713, 0.83081895, 0.82327586, 0.81869614, 0.8125, 0.80630386]
        run 2: val_acc = [0.8504849, 0.84886855, 0.8448276, 0.8445582, 0.83943963, 0.8356681, 0.8313578, 0.8213901, 0.81977373, 0.8127694]
        run 3: val_acc = [0.8507543, 0.8483297, 0.84428877, 0.83863145, 0.8370151, 0.8292026, 0.82273704, 0.8170797, 0.80953664, 0.8036099]

    compile outside strategy.scope():

        run 1: val_acc = [0.8836207, 0.8771552, 0.8728448, 0.8685345, 0.8663793, 0.86206895, 0.85991377, 0.8491379, 0.83836204, 0.8362069]
        run 2: val_acc = [0.8814655, 0.8771552, 0.875, 0.8728448, 0.8663793, 0.85991377, 0.85560346, 0.8491379, 0.8405172, 0.82974136]
        run 3: val_acc = [0.87068963, 0.86422414, 0.86206895, 0.85775864, 0.8512931, 0.8426724, 0.8426724, 0.8426724, 0.83189654, 0.8125]
tensorflow/tensorflow

Answer questions tomerk

Hi @chiapas , it was a mix of issues:

  1. Metrics have to be used in the same strategy they were created in, or it can cause issues with TPUStrategy (because if you do it wrong you'll only get the metric result from one replica instead of the overall metric aggregated across all of your replicas). Nightly should throw an error if it detects this now (on TPUStrategy). It did not use to before.

  2. Even if a model was created in a distribution strategy scope (and captures that distribution strategy), compile wasn't entering the captured scope. Because compile sometimes creates metrics (if you pass in string names), this meant the metrics compile created were not in the model's distribution strategy scope.

We've now made it so that:

  • compile enters the distribution strategy scope the model was created in, allowing the metrics to be created with the correct distribution strategy
  • If you explicitly pass in a pre-constructed metric to compile that was created in the wrong scope, compile will raise an error.
useful!

Related questions

ModuleNotFoundError: No module named 'tensorflow.contrib' hot 8
Error occurred when finalizing GeneratorDataset iterator hot 6
ModuleNotFoundError: No module named 'tensorflow.contrib'
When importing TensorFlow, error loading Hadoop
tf.keras.layers.Conv1DTranspose ?
tensorflow-gpu CUPTI errors hot 4
[TF 2.0] tf.keras.optimizers.Adam hot 4
Lossy conversion from float32 to uint8. Range [0, 1]. Convert image to uint8 prior to saving to suppress this warning. hot 4
TF2.0 AutoGraph issue hot 4
Tf.Keras metrics issue hot 4
module 'tensorflow' has no attribute 'ConfigProto' hot 4
TF 2.0 'Tensor' object has no attribute 'numpy' while using .numpy() although eager execution enabled by default hot 4
ModuleNotFoundError: No module named 'tensorflow.examples.tutorials' hot 4
AttributeError: module 'tensorflow.python.framework.op_def_registry' has no attribute 'register_op_list' hot 4
tensorflow2.0 detected 'xla_gpu' , but 'gpu' expected hot 3
Github User Rank List