Yuefeng Zhou yuefengz Google Mountain View, CA

sql-machine-learning/elasticdl 420

Kubernetes-native Deep Learning Framework

yuefengz/benchmarks 1

Benchmark code

alalei/NightyBird_Server 0

Server program for NightyBird App

yuefengz/community 0

Stores documents used by the TensorFlow developer community

yuefengz/ecosystem 0

Integration of TensorFlow with other open-source frameworks

yuefengz/models 0

Models and examples built with TensorFlow

yuefengz/tensorflow 0

An Open Source Machine Learning Framework for Everyone

issue commenttensorflow/tensorflow

Support synchronous training with parameter servers using Distribution Strategies

We will send out a RFC for PSStrategy soon.


comment created time in 5 days

issue commenttensorflow/tensorflow

MultiWorkerMirroredStrategy Performance is low (2gpu, 2node) X1.3 speed-up

Closing now. Feel free to reopen or file a new issue if you see obvious problems on your profile.


comment created time in 14 days

issue closedtensorflow/tensorflow

MultiWorkerMirroredStrategy Performance is low (2gpu, 2node) X1.3 speed-up

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution: Ubuntu 18.04
TensorFlow installed from (source or binary): pip install tensorflow-gpu
TensorFlow version (use command below): 2.0
Python version: 3.6.9
CUDA/cuDNN version: 10/
GPU model and memory: Tesla P4  8G

Describe the current behavior I run the code described below:

TEST 1: (two machine)

os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker': ["server1:12345", "server2:12345"] }, 'task': {'type': 'worker', 'index': 0} })

In the other machine

os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker': ["server1:12345", "server2:12345"] }, 'task': {'type': 'worker', 'index': 1} })

When the script start processing the first epoch it crashes,

Describe the expected behavior

15s/epoch is so slow


TEST 2: (one machine)

os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker': ["server1:12345"] }, 'task': {'type': 'worker', 'index': 0} })

Describe the expected behavior

5s/epoch same as use strategy = tf.distribute.MirroredStrategy() one GPU card



import ssl
import os
import json
import argparse
import time

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

ssl._create_default_https_context = ssl._create_unverified_context

def configure_cluster(worker_hosts=None, task_index=-1):
    """Set multi-worker cluster spec in TF_CONFIG environment variable.
      worker_hosts: comma-separated list of worker ip:port pairs.
      Number of workers in the cluster.
    tf_config = json.loads(os.environ.get('TF_CONFIG', '{}'))
    if tf_config:
        num_workers = len(tf_config['cluster'].get('worker', []))
    elif worker_hosts:
        workers = worker_hosts.split(',')
        num_workers = len(workers)
        if num_workers > 1 and task_index < 0:
            raise ValueError('Must specify task_index when number of workers > 1')
        task_index = 0 if num_workers == 1 else task_index
        os.environ['TF_CONFIG'] = json.dumps({
            'cluster': {
                'worker': workers
            'task': {'type': 'worker', 'index': task_index}
        num_workers = 1
    return num_workers

parser = argparse.ArgumentParser(description='TensorFlow Benchmark',
parser.add_argument('--num-epochs', type=int, default=5, help='input batch size')
parser.add_argument('--batch-size-per-replica', type=int, default=32, help='input batch size')
parser.add_argument('--worker-method', type=str, default="NCCL")
parser.add_argument('--worker-hosts', type=str, default="localhost:23456")
parser.add_argument('--worker-index', type=int, default=0)

args = parser.parse_args()

worker_num = configure_cluster(args.worker_hosts, args.worker_index)
batch_size = args.batch_size_per_replica * worker_num
print('Batch Size: %d' % batch_size)

gpus = tf.config.experimental.list_physical_devices('GPU')
print("Physical GPU Devices Num:", len(gpus))
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

if args.worker_method == "AUTO":
    communication = tf.distribute.experimental.CollectiveCommunication.AUTO
elif args.worker_method == "RING":
    communication = tf.distribute.experimental.CollectiveCommunication.RING
    communication = tf.distribute.experimental.CollectiveCommunication.NCCL

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(

# logical_gpus = tf.config.experimental.list_logical_devices('GPU')
# print("Logical GPU Devices Num:", len(gpus))

def resize(image, label):
    image = tf.image.resize(image, [128, 128]) / 255.0
    return image, label

# if as_supervised is True,return image abd label
dataset, info = tfds.load("tf_flowers", split=tfds.Split.TRAIN, with_info=True, as_supervised=True)
dataset =

# options =
# options.experimental_distribute.auto_shard = False
# dataset = dataset.with_options(options)

def build_and_compile_cnn_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
        tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(info.features['label'].num_classes, activation='softmax')
    return model

with strategy.scope():
    multi_worker_model = build_and_compile_cnn_model()
print("Now training the distributed model")

class TimeHistory(tf.keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.times = []
        self.totaltime = time.time()

    def on_train_end(self, logs={}):
        self.totaltime = time.time() - self.totaltime

    def on_epoch_begin(self, batch, logs={}):
        self.epoch_time_start = time.time()

    def on_epoch_end(self, batch, logs={}):
        self.times.append(time.time() - self.epoch_time_start)

time_callback = TimeHistory()
steps_per_epoch = 100
print('Running benchmark...'), steps_per_epoch=steps_per_epoch, epochs=args.num_epochs, callbacks=[time_callback])
per_epoch_time = np.mean(time_callback.times[1:])
print("per_epoch_time:", per_epoch_time)
img_sec = batch_size * steps_per_epoch / per_epoch_time
print("Result:  {:.1f} pic/sec".format(img_sec))

In TEST 2: only 1 worker, 440pic/sec (batch_szie = 128)

In TEST 1: 2 workers, 610 pic/sec (batch_szie = 128*2) [Expect 440 *2 = 800+]

Question1: with dist MultiWorkerMirroredStrategy worker nums > 1, why Training is so slow


closed time in 14 days


issue commenttensorflow/tensorflow

MultiWorkerMirroredStrategy Performance is low (2gpu, 2node) X1.3 speed-up

There are many reasons why your models can be slow: networking, reading data, threads contention, etc. You can profile your program to see which part is the bottleneck:


comment created time in 20 days

issue commenttensorflow/tensorflow

Enable SO_REUSEPORT option in tensorflow training server

This is not supported for MultiWorkerMirroredStrategy to skip starting std servers. The server is started by context object. @haoyuz probably can share a workaround if any.


comment created time in 2 months

issue commenttensorflow/tensorflow

Enable SO_REUSEPORT option in tensorflow training server

I am curious why you want to start the servers upfront? Why not allow distribution strategies create std TF servers?


comment created time in 2 months

PR opened tensorflow/tensorflow

Allow evaluator not in cluster_spec, to be consistent with legacy Estimator

PiperOrigin-RevId: 281833366 Change-Id: Ic580172ba5ec038e246028031ec277b18f31ea56

+32 -3

0 comment

2 changed files

pr created time in 3 months

create barnchyuefengz/tensorflow

branch : cherrypicks_88YC0

created branch time in 3 months

fork yuefengz/tensorflow

An Open Source Machine Learning Framework for Everyone

fork in 3 months


started time in 4 months

issue commenttensorflow/tensorflow

Allow device groups / CPU in MirroredStrategy

The model parallelism project is on our radar. We've actively been working on it:


comment created time in 4 months

issue commenttensorflow/tensorflow

Keras is extremely slow under MirroredStrategy context when keras.layers.BatchNormalization is in use.

@bearsroom based on what you said I found another solution to temporarily bypass this problem by only saving the first replica of model., looks like the allreduces of the variables caused the slowness. I think it may be worthwhile to get a profile to see why these allreduces are slow.


comment created time in 5 months

pull request commenttensorflow/tensorflow

Add gradient accumulation support for distribution strategy

Thanks! I also believe this looks perfect as an addon. cc @facaiy


comment created time in 5 months

pull request commenttensorflow/tensorflow

Add gradient accumulation support for distribution strategy

Thank you for your PR! I believe this is an important feature to have as well.

I have a couple suggestions out of the concern that we would like to make our API stable. It looks to me that the logic is not deeply coupled with an optimizer. Is it possible for you to put GA in a separate wrapper optimizer class? Also, since we are promoting TF 2.0, it would be more future-proof to make the wrapper a subclass of optimizer v2 and make it eager-compatible.


comment created time in 5 months

issue commenttensorflow/tensorflow

Low GPU usage of RNN layers under MirroredStrategy

Could you try with latest nightly to see whether the problem is still there?


comment created time in 6 months

issue commenttensorflow/tensorflow

About `Evaluator` in TF_CONFIG

Another example would look like:

cluster = {'chief': ['host0:2222'],
                 'evaluator': ['host6:2222'],
                 'ps': ['host1:2222', 'host2:2222'],
                 'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

comment created time in 6 months