profile
viewpoint

dubey/weaver 215

A scalable, fast, consistent graph store

dubey/research_trends 7

Visualization tool for CS research

tfboyd/models 1

Models built with TensorFlow

dubey/benchmarks 0

Benchmark code

dubey/crosstex 0

CrossTeX is a BibTeX replacement, with better citation and bibliographic database support.

dubey/cs6210-f16 0

Course repository for Cornell CS 6210, Fall 2016

dubey/models 0

Models and examples built with TensorFlow

dubey/neo4j-shell-tools 0

A bunch of import/export tools for the neo4j-shell

dubey/protobuf 0

Protocol Buffers - Google's data interchange format

dubey/tensorflow 0

Computation using data flow graphs for scalable machine learning

issue closedtensorflow/tensorflow

performance issue in CollectiveAllReduceStrategy

Tried both CollectiveAllReduceStrategy and MirroredStrategy on a same 8 GPU machine. And CollectiveAllReduceStrategy is constantly 30% slower than MirroredStrategy no matter how many cards I am using.

The tensorflow version I am running is 1.12.0.

I am using estimator API and tried many different models including resnet50, resnet34, vgg etc.

Is this expected? Or I need some configuration to make CollectiveAllReduceStrategy faster? Thanks

closed time in a day

ustcyue

issue commenttensorflow/tensorflow

performance issue in CollectiveAllReduceStrategy

Closing due to inactivity; please reopen if you have updates or more questions.

ustcyue

comment created time in a day

issue commenttensorflow/tensorflow

MultiWorkerMirroredStrategy Keras Example Hangs

cc @ckkuang

sarthfrey-db

comment created time in 4 days

issue commenttensorflow/tensorflow

MultiWorkerMirroredStrategy Keras Example Hangs

I'm not sure but I think this could be originating from https://github.com/tensorflow/tensorflow/blob/4c9f77709aa612058434d91984e0b7b7e3ef5774/tensorflow/core/nccl/nccl_manager.cc#L51.

Can you rerun with the NCCL_DEBUG environment variable set to INFO?

sarthfrey-db

comment created time in a month

issue commenttensorflow/tensorflow

XLA drops performance across the nodes

Thanks for the clarification @vilmara.

In general the scaling across nodes may be lower because of lower bandwidth interconnect. In this case since the model size is not too large it may be possible to tune some parameters to get better performance.

For NCCL, the number of rings and the number of threads per socket may be useful. I'm not familiar with Horovod tuning. @byronyi may have more ideas, I think he's used NCCL over IB.

If you're still stuck, I suggest reaching out to Horovod and/or NCCL owners.

vilmara

comment created time in a month

issue commenttensorflow/tensorflow

XLA drops performance across the nodes

Hang on, from @vilmara's comments it seems like 4x GPUs is one node, and 8x GPUs is "across nodes". But @byronyi is claiming that 8x GPU is on a single node.

Can we please clarify the exact topology with 4 and 8 GPU runs?

vilmara

comment created time in a month

issue commenttensorflow/tensorflow

XLA drops performance across the nodes

Can you describe the cluster interconnect? I can imagine things slowing down if you have NVLink within hosts, but something much slower like TCP over ethernet across hosts.

vilmara

comment created time in a month

issue commentNVIDIA/nccl

Potential memory leak in graph/paths.cc

Thanks! Yes, it fixes the leak.

sanjoy

comment created time in a month

issue commentNVIDIA/nccl

Potential memory leak in graph/paths.cc

It would be good to get a fix soon, because we would like to include NCCL 2.5.6 in the next TF release.

The issue impacts our tests, which create multiple communicators and run leak checkers automatically. I'd rather not disable the test.

Do you have an ETA for the fix?

sanjoy

comment created time in a month

issue commenttensorflow/tensorflow

Lack of dataset length or cardinality causes `BaseCollectiveExecutor::StartAbort Out of range` issues

Yes, you can ignore the warning. The fix I submitted is available in nightly and should be a part of the next release.

00krishna

comment created time in a month

issue commenttensorflow/tensorflow

Lack of dataset length or cardinality causes `BaseCollectiveExecutor::StartAbort Out of range` issues

The collective executor warning probably comes from: https://github.com/tensorflow/tensorflow/blob/bb45024ae9d3df0127d1c1056b08f25e60ba601c/tensorflow/core/common_runtime/base_collective_executor.cc#L217 which is called from: https://github.com/tensorflow/tensorflow/blob/bb45024ae9d3df0127d1c1056b08f25e60ba601c/tensorflow/core/common_runtime/executor.cc#L2289

We can lower the logging level if it causes confusion for users.

00krishna

comment created time in 2 months

issue closedtensorflow/tensorflow

how to run test case in ring_reducer_test.cc

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):CentOS Linux release 7.6.1810 (Core)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:N/A
  • TensorFlow installed from (source or binary):source
  • TensorFlow version:1.14.0
  • Python version:3.6.8
  • Installed using virtualenv? pip? conda?:pip
  • Bazel version (if compiling from source): 0.25.0
  • GCC/Compiler version (if compiling from source):7.3.1
  • CUDA/cuDNN version:cuda:10.0, cudnn:7.5.0.56
  • GPU model and memory:Telas v100, 32GB

I want to run test case in tensorflow/core/common_runtime/ring_reducer_test.cc, so I try this:

bazel test -c opt --config=cuda //tensorflow/core:ring_reducer_test

but bazel give me a error messages:

debian ~/collective/tensorflow $ bazel test -c opt --config=cuda //tensorflow/core:ring_reducer_test
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
ERROR: Skipping '//tensorflow/core:ring_reducer_test': no such target '//tensorflow/core:ring_reducer_test': target 'ring_reducer_test' not declared in package 'tensorflow/core' defined by /home/zxy/collective/tensorflow/tensorflow/core/BUILD
ERROR: no such target '//tensorflow/core:ring_reducer_test': target 'ring_reducer_test' not declared in package 'tensorflow/core' defined by /home/zxy/collective/tensorflow/tensorflow/core/BUILD
INFO: Elapsed time: 0.212s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
FAILED: Build did NOT complete successfully (0 packages loaded)

bazel thinks that ring_reducer_test is not exist in /home/zxy/collective/tensorflow/tensorflow/core/BUILD, but I do find it. So can you help me to run test case in ring_reducer_test.cc?

closed time in 2 months

Keepmoving-ZXY

issue commenttensorflow/tensorflow

how to run test case in ring_reducer_test.cc

ring_reducer_test is defined in core/BUILD using tf_cc_tests_gpu. If you follow the definition of tf_cc_tests_gpu in tensorflow.bzl, you will eventually arrive at src_to_test_name, which contains the logic for which you're looking.

Keepmoving-ZXY

comment created time in 2 months

issue commenttensorflow/tensorflow

how to run test case in ring_reducer_test.cc

Can you try bazel test -c opt --config=cuda //tensorflow/core:common_runtime_ring_reducer_test?

Keepmoving-ZXY

comment created time in 2 months

issue commenttensorflow/tensorflow

Run multi-worker with nccl error: NET/IB : collective mismatch error

I don't have access to IB. @hustcat it would be good to understand if setting NCCL_IB_DISABLE=1 as @byronyi suggests fixes the issue. If you can also run with TF_CPP_VMODULE="nccl_manager=2" you will get additional logs from TF that can help debug this issue.

hustcat

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Fix bug in NCCL broadcast wrapper

 void NcclManager::LoopKernelLaunches(NcclStream* nccl_stream) {         if (p->output) {           recvbuff = const_cast<char*>(p->output->tensor_data().data());           num_elements = p->output->NumElements();+        } else {+          // Operate in-place if no output (for the src node).+          recvbuff = const_cast<void*>(sendbuff);

Ah yes, I missed that you are casting sendbuff and not tensor_data.data().

benbarsdell

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Fix bug in NCCL broadcast wrapper

 void NcclManager::LoopKernelLaunches(NcclStream* nccl_stream) {         if (p->output) {           recvbuff = const_cast<char*>(p->output->tensor_data().data());           num_elements = p->output->NumElements();+        } else {+          // Operate in-place if no output (for the src node).+          recvbuff = const_cast<void*>(sendbuff);

nit: could you change to const_cast<char*> to be consistent with the rest of this file?

benbarsdell

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Update collective op to enable polymorphic output shape

 def testCollectiveGatherShapeMismatchAcrossDevices(self):                                    'Shape mismatch'):         sess.run([c0, c1], options=run_options) +  @test_util.run_deprecated_v1+  def testCollectiveGatherPolymorphicShape(self):+    t0 = [0, 1, 2, 3, 4, 5, 6, 7]+    t1 = [10, 11, 12, 13, 14, 15, 16, 17]+    t01 = [0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17]

nit: can you rename these variables to be more descriptive? Perhaps input0, input1, expected_output0, expected_output1.

pw2393

comment created time in 3 months

issue commenttensorflow/tensorflow

Simultaneous fetching of collective_ops and global_step triggers "Skipping rendezvous re-initialization"

I believe the "skipping rendezvous re-initialization" message was benign and has been fixed in bcb615c42a6215037ed7f1c81316bc0960e76269. Can you try to rerun your program with a nightly build to confirm?

whhu

comment created time in 4 months

PR opened tensorflow/benchmarks

Reviewers
Change tf-nightly-gpu-2.0-preview to tf-nightly-gpu.

https://groups.google.com/a/google.com/d/msg/tensorflow-team/JLh_OgJ9GrY/Rpx-gYisCwAJ

+2 -2

0 comment

1 changed file

pr created time in 5 months

push eventdubey/benchmarks

Ayush Dubey

commit sha 837982fa492cb4c9fe9514d2ad1be1aab0cf37de

fix

view details

push time in 5 months

create barnchdubey/benchmarks

branch : pip_update

created branch time in 5 months

push eventdubey/benchmarks

Ayush Dubey

commit sha 509b9d288937216ca7069f31cfb22aaa7db6a4a7

Run `pip3 instal wheel` before installing other packages. (#401)

view details

Toby Boyd

commit sha 880916403f7745955bb5930db74d07340de37586

add min-cpu-platform (#410)

view details

Toby Boyd

commit sha a6ab409fe241971c9c343ef3f186bfd6ce6d8803

upgrade pip and move to cudnn 7.6 and new tensorrt (#411)

view details

Toby Boyd

commit sha b44a111397be721cad23bd2f493e83aab4b5544c

Upgrade build docker to cuDNN 7.6.0 and remove stale DockerFiles (#412)

view details

Taylor Robie

commit sha 7ca37cbb842bc7ccf2d64babbc677ae24a250408

allow arbitrary pip specs in PerfZero (#413)

view details

Taylor Robie

commit sha 9eaf975e7eeef4d73f75c6a992e272637ffb847f

remove sed logic (#414)

view details

Toby Boyd

commit sha 3e96e00a37efb2bd0147974d938c74659a60c9d2

Add boot_ssd_size to have larger boot disks (#415)

view details

Toby Boyd

commit sha 0a23789ca5e9e7a018357bbf77638fd4006eea6d

Temp cuDNN 7_6_2. Move to main after testing. (#416)

view details

Toby Boyd

commit sha 4bb58d3b9b7014a9bd3f67f028b485e854767db8

Upgrade main dockers to cuDNN 7.6.2 (#417) * Move main docker to cuDNN 7.6.2 * cuDNN 7.6.2 upgrade

view details

sganeshb

commit sha d4d33cd0dd53051147fde562970223b5742ebda7

Add support to build local .whl files (#421) * Update setup.py to handle local wheel files * list current contents in the context. * Add local_file_support * local file support * Add missing $ sign * Update setup.py * Cleanup unused arg. * Add comments * Add a util function to create an empty file. * Update utils.py * Add param for local_tensorflow_pip_spec * Update setup.py * Update Dockerfile_ubuntu_1804_tf_v1 * Update Dockerfile_ubuntu_1804_tf_custom_pip * Update Dockerfile_ubuntu_1804_build * Update setup.py * Update setup.py * Update utils.py * Update setup.py

view details

sganeshb

commit sha c58c7bf6cd396f906ca20fa2db634332a2bb6086

Skip building docker if --dockerfile_path="" (#426) * Update perfzero_config.py * respect --enable_docker_setup * update lint * Update setup.py * Update perfzero_config.py * Update setup.py * Update perfzero_config.py * Update perfzero_config.py * Update perfzero_config.py

view details

sganeshb

commit sha d576c3b7e8fdfa4e06d3497b6f4d3f512e1cd10c

add --docker_tag as a flag Also fix inconsistent quote with --extra_pip_specs

view details

sganeshb

commit sha 1ed81ed10fc410adb2a66aa1a519d11a3ba9546b

Allow setting the docker tag from a flag --docker_tag (#427) * Use FLAGS.docker_tag * add --docker_tag as a flag

view details

sganeshb

commit sha abb1aec2f2db4ba73fac2e1359227aef59b10258

Add option to load a docker image if provided. (#428) * Add option to set decompress=false for .gz images. * Enable loading docker image from --dockerfile_path Currently --dockerfile_path points to a Dockerfile to save images. If --dockerfile_path points to <file>.tar.gz, load the docker instead treating it as an image. Also if --dockerfile_path is set to point to gs:// or tensorflow_pip_spec is set to gs:// we need to download and activate gcloud auth. * Error check for dockerfile_path

view details

push time in 5 months

push eventdubey/benchmarks

Ayush Dubey

commit sha 4f8045a434802ae5bff73dddf99c3cf68aa46222

Fix invocation to gcloud beta compute (#392)

view details

Toby Boyd

commit sha 1e3327b66248d084ea184efe583722fd30f1b637

fix install tf from http:// location. (#393)

view details

Toby Boyd

commit sha 8f6567872f89fc139f34ad258aac23164006db5c

cuDNN 7.6 docker. (#394)

view details

Toby Boyd

commit sha 0b0daf11728ed8a4dfc284df9ff6ef4eca18d0b2

Don't fail if nvidia-smi is not found. (#395)

view details

Toby Boyd

commit sha 0080d2ccf009ede5bf53daed2fe7676b3c818c71

Add mock needed by unit tests (#399)

view details

push time in 5 months

issue commenttensorflow/tensorflow

tf.distribute.MirroredStrategy leads to an infinite polling cycle with 4 GPUs

Thanks for posting the update! It may help others who run into a similar issue.

vmarkovtsev

comment created time in 5 months

issue closedtensorflow/tensorflow

tf.distribute.MirroredStrategy leads to an infinite polling cycle with 4 GPUs

System information

A physical tower with 4 GPUs running Ubuntu 18.04 over Kubernetes

  • 256 GB of RAM
  • TensorFlow: tested on tf-nightly-gpu-2.0-preview==2.0.0.dev20190902 to tf-nightly-gpu-2.0-preview==2.0.0.dev20190918
  • Python 3.6.8
  • CUDA 10.0, cuDNN 7.6.3.30 (also tested with cuDNN 7.5.0.56)
  • NVIDIA GTX 1080

<details> <summary>nvidia-smi</summary> <pre> +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 53% 70C P2 79W / 250W | 10889MiB / 11178MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 52% 69C P2 76W / 250W | 10893MiB / 11178MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A | | 48% 65C P2 78W / 250W | 10889MiB / 11178MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A | | 45% 62C P2 76W / 250W | 10893MiB / 11178MiB | 100% Default | +-------------------------------+----------------------+----------------------+ </pre> </details>

Problem

I run the following sample code:

#!/usr/bin/env python3
import sys
import tensorflow as tf


def main():
    batch_size = 12
    features_shape = 372, 558, 3
    labels = 10
    sample = tf.random.uniform(features_shape)

    def with_shape(t, shape):
        t = tf.squeeze(t)
        t.set_shape(shape)
        return t

    ds_train = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).map(lambda s, l: (with_shape(s, (batch_size,) + features_shape),
                                                      with_shape(l, (batch_size, labels))))
    ds_val = tf.data.Dataset.from_tensors([sample]).map(lambda s: (s, tf.ones((labels,)))) \
        .repeat().batch(batch_size).take(10).map(
        lambda s, l: (with_shape(s, (batch_size,) + features_shape), with_shape(l, (batch_size, labels))))
    with tf.distribute.MirroredStrategy().scope():
        model = tf.keras.applications.DenseNet121(
            weights=None, input_shape=features_shape, classes=labels)
        model.build((batch_size,) + features_shape)
        model.summary()
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
        cross_entropy = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.1)
        model.compile(optimizer=optimizer, loss=cross_entropy, metrics=["accuracy"])
    model.fit(ds_train, validation_data=ds_val, epochs=1, steps_per_epoch=100)


if __name__ == "__main__":
    sys.exit(main())

It outputs the following log and hangs for at least 9 hours (I killed it after):

<details> <summary>log</summary> <pre> 2019-09-19 11:22:16.548532: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-09-19 11:22:16.553080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:02:00.0 2019-09-19 11:22:16.554064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:03:00.0 2019-09-19 11:22:16.555051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:82:00.0 2019-09-19 11:22:16.555890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:83:00.0 2019-09-19 11:22:16.556021: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcudart.so.10.0 2019-09-19 11:22:16.556046: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcublas.so.10.0 2019-09-19 11:22:16.556062: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcufft.so.10.0 2019-09-19 11:22:16.556079: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcurand.so.10.0 2019-09-19 11:22:16.556095: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcusolver.so.10.0 2019-09-19 11:22:16.556111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcusparse.so.10.0 2019-09-19 11:22:16.556127: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcudnn.so.7 2019-09-19 11:22:16.562745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Adding visible gpu devices: 0, 1, 2, 3 2019-09-19 11:22:16.562815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcudart.so.10.0 2019-09-19 11:22:16.566634: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1173] Device interconnect StreamExecutorwith strength 1 edge matrix: 2019-09-19 11:22:16.566650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1179] 0 1 2 3 2019-09-19 11:22:16.566657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 0: N Y N N 2019-09-19 11:22:16.566661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 1: Y N N N 2019-09-19 11:22:16.566666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 2: N N N Y 2019-09-19 11:22:16.566670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 3: N N Y N 2019-09-19 11:22:16.571630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10470 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1) 2019-09-19 11:22:16.573706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10470 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1) 2019-09-19 11:22:16.575382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10470 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1) 2019-09-19 11:22:16.576566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10470 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1) WARNING:tensorflow:Entity <function main.<locals>.<lambda> at 0x7fe776f021e0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: expected exactly one node node, found [] 2019-09-19 11:22:17.393146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:02:00.0 2019-09-19 11:22:17.394380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:03:00.0 2019-09-19 11:22:17.395221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:82:00.0 2019-09-19 11:22:17.396088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1632] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:83:00.0 2019-09-19 11:22:17.396168: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcudart.so.10.0 2019-09-19 11:22:17.396202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcublas.so.10.0 2019-09-19 11:22:17.396218: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcufft.so.10.0 2019-09-19 11:22:17.396233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcurand.so.10.0 2019-09-19 11:22:17.396263: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcusolver.so.10.0 2019-09-19 11:22:17.396278: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcusparse.so.10.0 2019-09-19 11:22:17.396293: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcudnn.so.7 2019-09-19 11:22:17.402450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Adding visible gpu devices: 0, 1, 2, 3 2019-09-19 11:22:17.402599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1173] Device interconnect StreamExecutorwith strength 1 edge matrix: 2019-09-19 11:22:17.402611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1179] 0 1 2 3 2019-09-19 11:22:17.402619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 0: N Y N N 2019-09-19 11:22:17.402625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 1: Y N N N 2019-09-19 11:22:17.402631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 2: N N N Y 2019-09-19 11:22:17.402637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1192] 3: N N Y N 2019-09-19 11:22:17.407338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/device:GPU:0 with 10470 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1) 2019-09-19 11:22:17.408425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/device:GPU:1 with 10470 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1) 2019-09-19 11:22:17.409430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/device:GPU:2 with 10470 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1) 2019-09-19 11:22:17.410293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1318] Created TensorFlow device (/device:GPU:3 with 10470 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1) Model: "densenet121"


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) [(None, 372, 558, 3) 0


zero_padding2d (ZeroPadding2D) (None, 378, 564, 3) 0 input_1[0][0]


conv1/conv (Conv2D) (None, 186, 279, 64) 9408 zero_padding2d[0][0]


conv1/bn (BatchNormalization) (None, 186, 279, 64) 256 conv1/conv[0][0]


conv1/relu (Activation) (None, 186, 279, 64) 0 conv1/bn[0][0]


zero_padding2d_1 (ZeroPadding2D (None, 188, 281, 64) 0 conv1/relu[0][0]


pool1 (MaxPooling2D) (None, 93, 140, 64) 0 zero_padding2d_1[0][0]


conv2_block1_0_bn (BatchNormali (None, 93, 140, 64) 256 pool1[0][0]


conv2_block1_0_relu (Activation (None, 93, 140, 64) 0 conv2_block1_0_bn[0][0]


conv2_block1_1_conv (Conv2D) (None, 93, 140, 128) 8192 conv2_block1_0_relu[0][0]


conv2_block1_1_bn (BatchNormali (None, 93, 140, 128) 512 conv2_block1_1_conv[0][0]


conv2_block1_1_relu (Activation (None, 93, 140, 128) 0 conv2_block1_1_bn[0][0]


conv2_block1_2_conv (Conv2D) (None, 93, 140, 32) 36864 conv2_block1_1_relu[0][0]


conv2_block1_concat (Concatenat (None, 93, 140, 96) 0 pool1[0][0] conv2_block1_2_conv[0][0]


conv2_block2_0_bn (BatchNormali (None, 93, 140, 96) 384 conv2_block1_concat[0][0]


conv2_block2_0_relu (Activation (None, 93, 140, 96) 0 conv2_block2_0_bn[0][0]


conv2_block2_1_conv (Conv2D) (None, 93, 140, 128) 12288 conv2_block2_0_relu[0][0]


conv2_block2_1_bn (BatchNormali (None, 93, 140, 128) 512 conv2_block2_1_conv[0][0]


conv2_block2_1_relu (Activation (None, 93, 140, 128) 0 conv2_block2_1_bn[0][0]


conv2_block2_2_conv (Conv2D) (None, 93, 140, 32) 36864 conv2_block2_1_relu[0][0]


conv2_block2_concat (Concatenat (None, 93, 140, 128) 0 conv2_block1_concat[0][0] conv2_block2_2_conv[0][0]


conv2_block3_0_bn (BatchNormali (None, 93, 140, 128) 512 conv2_block2_concat[0][0]


conv2_block3_0_relu (Activation (None, 93, 140, 128) 0 conv2_block3_0_bn[0][0]


conv2_block3_1_conv (Conv2D) (None, 93, 140, 128) 16384 conv2_block3_0_relu[0][0]


conv2_block3_1_bn (BatchNormali (None, 93, 140, 128) 512 conv2_block3_1_conv[0][0]


conv2_block3_1_relu (Activation (None, 93, 140, 128) 0 conv2_block3_1_bn[0][0]


conv2_block3_2_conv (Conv2D) (None, 93, 140, 32) 36864 conv2_block3_1_relu[0][0]


conv2_block3_concat (Concatenat (None, 93, 140, 160) 0 conv2_block2_concat[0][0] conv2_block3_2_conv[0][0]


conv2_block4_0_bn (BatchNormali (None, 93, 140, 160) 640 conv2_block3_concat[0][0]


conv2_block4_0_relu (Activation (None, 93, 140, 160) 0 conv2_block4_0_bn[0][0]


conv2_block4_1_conv (Conv2D) (None, 93, 140, 128) 20480 conv2_block4_0_relu[0][0]


conv2_block4_1_bn (BatchNormali (None, 93, 140, 128) 512 conv2_block4_1_conv[0][0]


conv2_block4_1_relu (Activation (None, 93, 140, 128) 0 conv2_block4_1_bn[0][0]


conv2_block4_2_conv (Conv2D) (None, 93, 140, 32) 36864 conv2_block4_1_relu[0][0]


conv2_block4_concat (Concatenat (None, 93, 140, 192) 0 conv2_block3_concat[0][0] conv2_block4_2_conv[0][0]


conv2_block5_0_bn (BatchNormali (None, 93, 140, 192) 768 conv2_block4_concat[0][0]


conv2_block5_0_relu (Activation (None, 93, 140, 192) 0 conv2_block5_0_bn[0][0]


conv2_block5_1_conv (Conv2D) (None, 93, 140, 128) 24576 conv2_block5_0_relu[0][0]


conv2_block5_1_bn (BatchNormali (None, 93, 140, 128) 512 conv2_block5_1_conv[0][0]


conv2_block5_1_relu (Activation (None, 93, 140, 128) 0 conv2_block5_1_bn[0][0]


conv2_block5_2_conv (Conv2D) (None, 93, 140, 32) 36864 conv2_block5_1_relu[0][0]


conv2_block5_concat (Concatenat (None, 93, 140, 224) 0 conv2_block4_concat[0][0] conv2_block5_2_conv[0][0]


conv2_block6_0_bn (BatchNormali (None, 93, 140, 224) 896 conv2_block5_concat[0][0]


conv2_block6_0_relu (Activation (None, 93, 140, 224) 0 conv2_block6_0_bn[0][0]


conv2_block6_1_conv (Conv2D) (None, 93, 140, 128) 28672 conv2_block6_0_relu[0][0]


conv2_block6_1_bn (BatchNormali (None, 93, 140, 128) 512 conv2_block6_1_conv[0][0]


conv2_block6_1_relu (Activation (None, 93, 140, 128) 0 conv2_block6_1_bn[0][0]


conv2_block6_2_conv (Conv2D) (None, 93, 140, 32) 36864 conv2_block6_1_relu[0][0]


conv2_block6_concat (Concatenat (None, 93, 140, 256) 0 conv2_block5_concat[0][0] conv2_block6_2_conv[0][0]


pool2_bn (BatchNormalization) (None, 93, 140, 256) 1024 conv2_block6_concat[0][0]


pool2_relu (Activation) (None, 93, 140, 256) 0 pool2_bn[0][0]


pool2_conv (Conv2D) (None, 93, 140, 128) 32768 pool2_relu[0][0]


pool2_pool (AveragePooling2D) (None, 46, 70, 128) 0 pool2_conv[0][0]


conv3_block1_0_bn (BatchNormali (None, 46, 70, 128) 512 pool2_pool[0][0]


conv3_block1_0_relu (Activation (None, 46, 70, 128) 0 conv3_block1_0_bn[0][0]


conv3_block1_1_conv (Conv2D) (None, 46, 70, 128) 16384 conv3_block1_0_relu[0][0]


conv3_block1_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block1_1_conv[0][0]


conv3_block1_1_relu (Activation (None, 46, 70, 128) 0 conv3_block1_1_bn[0][0]


conv3_block1_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block1_1_relu[0][0]


conv3_block1_concat (Concatenat (None, 46, 70, 160) 0 pool2_pool[0][0] conv3_block1_2_conv[0][0]


conv3_block2_0_bn (BatchNormali (None, 46, 70, 160) 640 conv3_block1_concat[0][0]


conv3_block2_0_relu (Activation (None, 46, 70, 160) 0 conv3_block2_0_bn[0][0]


conv3_block2_1_conv (Conv2D) (None, 46, 70, 128) 20480 conv3_block2_0_relu[0][0]


conv3_block2_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block2_1_conv[0][0]


conv3_block2_1_relu (Activation (None, 46, 70, 128) 0 conv3_block2_1_bn[0][0]


conv3_block2_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block2_1_relu[0][0]


conv3_block2_concat (Concatenat (None, 46, 70, 192) 0 conv3_block1_concat[0][0] conv3_block2_2_conv[0][0]


conv3_block3_0_bn (BatchNormali (None, 46, 70, 192) 768 conv3_block2_concat[0][0]


conv3_block3_0_relu (Activation (None, 46, 70, 192) 0 conv3_block3_0_bn[0][0]


conv3_block3_1_conv (Conv2D) (None, 46, 70, 128) 24576 conv3_block3_0_relu[0][0]


conv3_block3_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block3_1_conv[0][0]


conv3_block3_1_relu (Activation (None, 46, 70, 128) 0 conv3_block3_1_bn[0][0]


conv3_block3_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block3_1_relu[0][0]


conv3_block3_concat (Concatenat (None, 46, 70, 224) 0 conv3_block2_concat[0][0] conv3_block3_2_conv[0][0]


conv3_block4_0_bn (BatchNormali (None, 46, 70, 224) 896 conv3_block3_concat[0][0]


conv3_block4_0_relu (Activation (None, 46, 70, 224) 0 conv3_block4_0_bn[0][0]


conv3_block4_1_conv (Conv2D) (None, 46, 70, 128) 28672 conv3_block4_0_relu[0][0]


conv3_block4_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block4_1_conv[0][0]


conv3_block4_1_relu (Activation (None, 46, 70, 128) 0 conv3_block4_1_bn[0][0]


conv3_block4_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block4_1_relu[0][0]


conv3_block4_concat (Concatenat (None, 46, 70, 256) 0 conv3_block3_concat[0][0] conv3_block4_2_conv[0][0]


conv3_block5_0_bn (BatchNormali (None, 46, 70, 256) 1024 conv3_block4_concat[0][0]


conv3_block5_0_relu (Activation (None, 46, 70, 256) 0 conv3_block5_0_bn[0][0]


conv3_block5_1_conv (Conv2D) (None, 46, 70, 128) 32768 conv3_block5_0_relu[0][0]


conv3_block5_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block5_1_conv[0][0]


conv3_block5_1_relu (Activation (None, 46, 70, 128) 0 conv3_block5_1_bn[0][0]


conv3_block5_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block5_1_relu[0][0]


conv3_block5_concat (Concatenat (None, 46, 70, 288) 0 conv3_block4_concat[0][0] conv3_block5_2_conv[0][0]


conv3_block6_0_bn (BatchNormali (None, 46, 70, 288) 1152 conv3_block5_concat[0][0]


conv3_block6_0_relu (Activation (None, 46, 70, 288) 0 conv3_block6_0_bn[0][0]


conv3_block6_1_conv (Conv2D) (None, 46, 70, 128) 36864 conv3_block6_0_relu[0][0]


conv3_block6_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block6_1_conv[0][0]


conv3_block6_1_relu (Activation (None, 46, 70, 128) 0 conv3_block6_1_bn[0][0]


conv3_block6_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block6_1_relu[0][0]


conv3_block6_concat (Concatenat (None, 46, 70, 320) 0 conv3_block5_concat[0][0] conv3_block6_2_conv[0][0]


conv3_block7_0_bn (BatchNormali (None, 46, 70, 320) 1280 conv3_block6_concat[0][0]


conv3_block7_0_relu (Activation (None, 46, 70, 320) 0 conv3_block7_0_bn[0][0]


conv3_block7_1_conv (Conv2D) (None, 46, 70, 128) 40960 conv3_block7_0_relu[0][0]


conv3_block7_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block7_1_conv[0][0]


conv3_block7_1_relu (Activation (None, 46, 70, 128) 0 conv3_block7_1_bn[0][0]


conv3_block7_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block7_1_relu[0][0]


conv3_block7_concat (Concatenat (None, 46, 70, 352) 0 conv3_block6_concat[0][0] conv3_block7_2_conv[0][0]


conv3_block8_0_bn (BatchNormali (None, 46, 70, 352) 1408 conv3_block7_concat[0][0]


conv3_block8_0_relu (Activation (None, 46, 70, 352) 0 conv3_block8_0_bn[0][0]


conv3_block8_1_conv (Conv2D) (None, 46, 70, 128) 45056 conv3_block8_0_relu[0][0]


conv3_block8_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block8_1_conv[0][0]


conv3_block8_1_relu (Activation (None, 46, 70, 128) 0 conv3_block8_1_bn[0][0]


conv3_block8_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block8_1_relu[0][0]


conv3_block8_concat (Concatenat (None, 46, 70, 384) 0 conv3_block7_concat[0][0] conv3_block8_2_conv[0][0]


conv3_block9_0_bn (BatchNormali (None, 46, 70, 384) 1536 conv3_block8_concat[0][0]


conv3_block9_0_relu (Activation (None, 46, 70, 384) 0 conv3_block9_0_bn[0][0]


conv3_block9_1_conv (Conv2D) (None, 46, 70, 128) 49152 conv3_block9_0_relu[0][0]


conv3_block9_1_bn (BatchNormali (None, 46, 70, 128) 512 conv3_block9_1_conv[0][0]


conv3_block9_1_relu (Activation (None, 46, 70, 128) 0 conv3_block9_1_bn[0][0]


conv3_block9_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block9_1_relu[0][0]


conv3_block9_concat (Concatenat (None, 46, 70, 416) 0 conv3_block8_concat[0][0] conv3_block9_2_conv[0][0]


conv3_block10_0_bn (BatchNormal (None, 46, 70, 416) 1664 conv3_block9_concat[0][0]


conv3_block10_0_relu (Activatio (None, 46, 70, 416) 0 conv3_block10_0_bn[0][0]


conv3_block10_1_conv (Conv2D) (None, 46, 70, 128) 53248 conv3_block10_0_relu[0][0]


conv3_block10_1_bn (BatchNormal (None, 46, 70, 128) 512 conv3_block10_1_conv[0][0]


conv3_block10_1_relu (Activatio (None, 46, 70, 128) 0 conv3_block10_1_bn[0][0]


conv3_block10_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block10_1_relu[0][0]


conv3_block10_concat (Concatena (None, 46, 70, 448) 0 conv3_block9_concat[0][0] conv3_block10_2_conv[0][0]


conv3_block11_0_bn (BatchNormal (None, 46, 70, 448) 1792 conv3_block10_concat[0][0]


conv3_block11_0_relu (Activatio (None, 46, 70, 448) 0 conv3_block11_0_bn[0][0]


conv3_block11_1_conv (Conv2D) (None, 46, 70, 128) 57344 conv3_block11_0_relu[0][0]


conv3_block11_1_bn (BatchNormal (None, 46, 70, 128) 512 conv3_block11_1_conv[0][0]


conv3_block11_1_relu (Activatio (None, 46, 70, 128) 0 conv3_block11_1_bn[0][0]


conv3_block11_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block11_1_relu[0][0]


conv3_block11_concat (Concatena (None, 46, 70, 480) 0 conv3_block10_concat[0][0] conv3_block11_2_conv[0][0]


conv3_block12_0_bn (BatchNormal (None, 46, 70, 480) 1920 conv3_block11_concat[0][0]


conv3_block12_0_relu (Activatio (None, 46, 70, 480) 0 conv3_block12_0_bn[0][0]


conv3_block12_1_conv (Conv2D) (None, 46, 70, 128) 61440 conv3_block12_0_relu[0][0]


conv3_block12_1_bn (BatchNormal (None, 46, 70, 128) 512 conv3_block12_1_conv[0][0]


conv3_block12_1_relu (Activatio (None, 46, 70, 128) 0 conv3_block12_1_bn[0][0]


conv3_block12_2_conv (Conv2D) (None, 46, 70, 32) 36864 conv3_block12_1_relu[0][0]


conv3_block12_concat (Concatena (None, 46, 70, 512) 0 conv3_block11_concat[0][0] conv3_block12_2_conv[0][0]


pool3_bn (BatchNormalization) (None, 46, 70, 512) 2048 conv3_block12_concat[0][0]


pool3_relu (Activation) (None, 46, 70, 512) 0 pool3_bn[0][0]


pool3_conv (Conv2D) (None, 46, 70, 256) 131072 pool3_relu[0][0]


pool3_pool (AveragePooling2D) (None, 23, 35, 256) 0 pool3_conv[0][0]


conv4_block1_0_bn (BatchNormali (None, 23, 35, 256) 1024 pool3_pool[0][0]


conv4_block1_0_relu (Activation (None, 23, 35, 256) 0 conv4_block1_0_bn[0][0]


conv4_block1_1_conv (Conv2D) (None, 23, 35, 128) 32768 conv4_block1_0_relu[0][0]


conv4_block1_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block1_1_conv[0][0]


conv4_block1_1_relu (Activation (None, 23, 35, 128) 0 conv4_block1_1_bn[0][0]


conv4_block1_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block1_1_relu[0][0]


conv4_block1_concat (Concatenat (None, 23, 35, 288) 0 pool3_pool[0][0] conv4_block1_2_conv[0][0]


conv4_block2_0_bn (BatchNormali (None, 23, 35, 288) 1152 conv4_block1_concat[0][0]


conv4_block2_0_relu (Activation (None, 23, 35, 288) 0 conv4_block2_0_bn[0][0]


conv4_block2_1_conv (Conv2D) (None, 23, 35, 128) 36864 conv4_block2_0_relu[0][0]


conv4_block2_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block2_1_conv[0][0]


conv4_block2_1_relu (Activation (None, 23, 35, 128) 0 conv4_block2_1_bn[0][0]


conv4_block2_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block2_1_relu[0][0]


conv4_block2_concat (Concatenat (None, 23, 35, 320) 0 conv4_block1_concat[0][0] conv4_block2_2_conv[0][0]


conv4_block3_0_bn (BatchNormali (None, 23, 35, 320) 1280 conv4_block2_concat[0][0]


conv4_block3_0_relu (Activation (None, 23, 35, 320) 0 conv4_block3_0_bn[0][0]


conv4_block3_1_conv (Conv2D) (None, 23, 35, 128) 40960 conv4_block3_0_relu[0][0]


conv4_block3_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block3_1_conv[0][0]


conv4_block3_1_relu (Activation (None, 23, 35, 128) 0 conv4_block3_1_bn[0][0]


conv4_block3_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block3_1_relu[0][0]


conv4_block3_concat (Concatenat (None, 23, 35, 352) 0 conv4_block2_concat[0][0] conv4_block3_2_conv[0][0]


conv4_block4_0_bn (BatchNormali (None, 23, 35, 352) 1408 conv4_block3_concat[0][0]


conv4_block4_0_relu (Activation (None, 23, 35, 352) 0 conv4_block4_0_bn[0][0]


conv4_block4_1_conv (Conv2D) (None, 23, 35, 128) 45056 conv4_block4_0_relu[0][0]


conv4_block4_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block4_1_conv[0][0]


conv4_block4_1_relu (Activation (None, 23, 35, 128) 0 conv4_block4_1_bn[0][0]


conv4_block4_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block4_1_relu[0][0]


conv4_block4_concat (Concatenat (None, 23, 35, 384) 0 conv4_block3_concat[0][0] conv4_block4_2_conv[0][0]


conv4_block5_0_bn (BatchNormali (None, 23, 35, 384) 1536 conv4_block4_concat[0][0]


conv4_block5_0_relu (Activation (None, 23, 35, 384) 0 conv4_block5_0_bn[0][0]


conv4_block5_1_conv (Conv2D) (None, 23, 35, 128) 49152 conv4_block5_0_relu[0][0]


conv4_block5_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block5_1_conv[0][0]


conv4_block5_1_relu (Activation (None, 23, 35, 128) 0 conv4_block5_1_bn[0][0]


conv4_block5_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block5_1_relu[0][0]


conv4_block5_concat (Concatenat (None, 23, 35, 416) 0 conv4_block4_concat[0][0] conv4_block5_2_conv[0][0]


conv4_block6_0_bn (BatchNormali (None, 23, 35, 416) 1664 conv4_block5_concat[0][0]


conv4_block6_0_relu (Activation (None, 23, 35, 416) 0 conv4_block6_0_bn[0][0]


conv4_block6_1_conv (Conv2D) (None, 23, 35, 128) 53248 conv4_block6_0_relu[0][0]


conv4_block6_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block6_1_conv[0][0]


conv4_block6_1_relu (Activation (None, 23, 35, 128) 0 conv4_block6_1_bn[0][0]


conv4_block6_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block6_1_relu[0][0]


conv4_block6_concat (Concatenat (None, 23, 35, 448) 0 conv4_block5_concat[0][0] conv4_block6_2_conv[0][0]


conv4_block7_0_bn (BatchNormali (None, 23, 35, 448) 1792 conv4_block6_concat[0][0]


conv4_block7_0_relu (Activation (None, 23, 35, 448) 0 conv4_block7_0_bn[0][0]


conv4_block7_1_conv (Conv2D) (None, 23, 35, 128) 57344 conv4_block7_0_relu[0][0]


conv4_block7_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block7_1_conv[0][0]


conv4_block7_1_relu (Activation (None, 23, 35, 128) 0 conv4_block7_1_bn[0][0]


conv4_block7_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block7_1_relu[0][0]


conv4_block7_concat (Concatenat (None, 23, 35, 480) 0 conv4_block6_concat[0][0] conv4_block7_2_conv[0][0]


conv4_block8_0_bn (BatchNormali (None, 23, 35, 480) 1920 conv4_block7_concat[0][0]


conv4_block8_0_relu (Activation (None, 23, 35, 480) 0 conv4_block8_0_bn[0][0]


conv4_block8_1_conv (Conv2D) (None, 23, 35, 128) 61440 conv4_block8_0_relu[0][0]


conv4_block8_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block8_1_conv[0][0]


conv4_block8_1_relu (Activation (None, 23, 35, 128) 0 conv4_block8_1_bn[0][0]


conv4_block8_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block8_1_relu[0][0]


conv4_block8_concat (Concatenat (None, 23, 35, 512) 0 conv4_block7_concat[0][0] conv4_block8_2_conv[0][0]


conv4_block9_0_bn (BatchNormali (None, 23, 35, 512) 2048 conv4_block8_concat[0][0]


conv4_block9_0_relu (Activation (None, 23, 35, 512) 0 conv4_block9_0_bn[0][0]


conv4_block9_1_conv (Conv2D) (None, 23, 35, 128) 65536 conv4_block9_0_relu[0][0]


conv4_block9_1_bn (BatchNormali (None, 23, 35, 128) 512 conv4_block9_1_conv[0][0]


conv4_block9_1_relu (Activation (None, 23, 35, 128) 0 conv4_block9_1_bn[0][0]


conv4_block9_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block9_1_relu[0][0]


conv4_block9_concat (Concatenat (None, 23, 35, 544) 0 conv4_block8_concat[0][0] conv4_block9_2_conv[0][0]


conv4_block10_0_bn (BatchNormal (None, 23, 35, 544) 2176 conv4_block9_concat[0][0]


conv4_block10_0_relu (Activatio (None, 23, 35, 544) 0 conv4_block10_0_bn[0][0]


conv4_block10_1_conv (Conv2D) (None, 23, 35, 128) 69632 conv4_block10_0_relu[0][0]


conv4_block10_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block10_1_conv[0][0]


conv4_block10_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block10_1_bn[0][0]


conv4_block10_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block10_1_relu[0][0]


conv4_block10_concat (Concatena (None, 23, 35, 576) 0 conv4_block9_concat[0][0] conv4_block10_2_conv[0][0]


conv4_block11_0_bn (BatchNormal (None, 23, 35, 576) 2304 conv4_block10_concat[0][0]


conv4_block11_0_relu (Activatio (None, 23, 35, 576) 0 conv4_block11_0_bn[0][0]


conv4_block11_1_conv (Conv2D) (None, 23, 35, 128) 73728 conv4_block11_0_relu[0][0]


conv4_block11_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block11_1_conv[0][0]


conv4_block11_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block11_1_bn[0][0]


conv4_block11_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block11_1_relu[0][0]


conv4_block11_concat (Concatena (None, 23, 35, 608) 0 conv4_block10_concat[0][0] conv4_block11_2_conv[0][0]


conv4_block12_0_bn (BatchNormal (None, 23, 35, 608) 2432 conv4_block11_concat[0][0]


conv4_block12_0_relu (Activatio (None, 23, 35, 608) 0 conv4_block12_0_bn[0][0]


conv4_block12_1_conv (Conv2D) (None, 23, 35, 128) 77824 conv4_block12_0_relu[0][0]


conv4_block12_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block12_1_conv[0][0]


conv4_block12_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block12_1_bn[0][0]


conv4_block12_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block12_1_relu[0][0]


conv4_block12_concat (Concatena (None, 23, 35, 640) 0 conv4_block11_concat[0][0] conv4_block12_2_conv[0][0]


conv4_block13_0_bn (BatchNormal (None, 23, 35, 640) 2560 conv4_block12_concat[0][0]


conv4_block13_0_relu (Activatio (None, 23, 35, 640) 0 conv4_block13_0_bn[0][0]


conv4_block13_1_conv (Conv2D) (None, 23, 35, 128) 81920 conv4_block13_0_relu[0][0]


conv4_block13_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block13_1_conv[0][0]


conv4_block13_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block13_1_bn[0][0]


conv4_block13_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block13_1_relu[0][0]


conv4_block13_concat (Concatena (None, 23, 35, 672) 0 conv4_block12_concat[0][0] conv4_block13_2_conv[0][0]


conv4_block14_0_bn (BatchNormal (None, 23, 35, 672) 2688 conv4_block13_concat[0][0]


conv4_block14_0_relu (Activatio (None, 23, 35, 672) 0 conv4_block14_0_bn[0][0]


conv4_block14_1_conv (Conv2D) (None, 23, 35, 128) 86016 conv4_block14_0_relu[0][0]


conv4_block14_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block14_1_conv[0][0]


conv4_block14_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block14_1_bn[0][0]


conv4_block14_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block14_1_relu[0][0]


conv4_block14_concat (Concatena (None, 23, 35, 704) 0 conv4_block13_concat[0][0] conv4_block14_2_conv[0][0]


conv4_block15_0_bn (BatchNormal (None, 23, 35, 704) 2816 conv4_block14_concat[0][0]


conv4_block15_0_relu (Activatio (None, 23, 35, 704) 0 conv4_block15_0_bn[0][0]


conv4_block15_1_conv (Conv2D) (None, 23, 35, 128) 90112 conv4_block15_0_relu[0][0]


conv4_block15_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block15_1_conv[0][0]


conv4_block15_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block15_1_bn[0][0]


conv4_block15_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block15_1_relu[0][0]


conv4_block15_concat (Concatena (None, 23, 35, 736) 0 conv4_block14_concat[0][0] conv4_block15_2_conv[0][0]


conv4_block16_0_bn (BatchNormal (None, 23, 35, 736) 2944 conv4_block15_concat[0][0]


conv4_block16_0_relu (Activatio (None, 23, 35, 736) 0 conv4_block16_0_bn[0][0]


conv4_block16_1_conv (Conv2D) (None, 23, 35, 128) 94208 conv4_block16_0_relu[0][0]


conv4_block16_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block16_1_conv[0][0]


conv4_block16_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block16_1_bn[0][0]


conv4_block16_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block16_1_relu[0][0]


conv4_block16_concat (Concatena (None, 23, 35, 768) 0 conv4_block15_concat[0][0] conv4_block16_2_conv[0][0]


conv4_block17_0_bn (BatchNormal (None, 23, 35, 768) 3072 conv4_block16_concat[0][0]


conv4_block17_0_relu (Activatio (None, 23, 35, 768) 0 conv4_block17_0_bn[0][0]


conv4_block17_1_conv (Conv2D) (None, 23, 35, 128) 98304 conv4_block17_0_relu[0][0]


conv4_block17_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block17_1_conv[0][0]


conv4_block17_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block17_1_bn[0][0]


conv4_block17_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block17_1_relu[0][0]


conv4_block17_concat (Concatena (None, 23, 35, 800) 0 conv4_block16_concat[0][0] conv4_block17_2_conv[0][0]


conv4_block18_0_bn (BatchNormal (None, 23, 35, 800) 3200 conv4_block17_concat[0][0]


conv4_block18_0_relu (Activatio (None, 23, 35, 800) 0 conv4_block18_0_bn[0][0]


conv4_block18_1_conv (Conv2D) (None, 23, 35, 128) 102400 conv4_block18_0_relu[0][0]


conv4_block18_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block18_1_conv[0][0]


conv4_block18_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block18_1_bn[0][0]


conv4_block18_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block18_1_relu[0][0]


conv4_block18_concat (Concatena (None, 23, 35, 832) 0 conv4_block17_concat[0][0] conv4_block18_2_conv[0][0]


conv4_block19_0_bn (BatchNormal (None, 23, 35, 832) 3328 conv4_block18_concat[0][0]


conv4_block19_0_relu (Activatio (None, 23, 35, 832) 0 conv4_block19_0_bn[0][0]


conv4_block19_1_conv (Conv2D) (None, 23, 35, 128) 106496 conv4_block19_0_relu[0][0]


conv4_block19_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block19_1_conv[0][0]


conv4_block19_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block19_1_bn[0][0]


conv4_block19_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block19_1_relu[0][0]


conv4_block19_concat (Concatena (None, 23, 35, 864) 0 conv4_block18_concat[0][0] conv4_block19_2_conv[0][0]


conv4_block20_0_bn (BatchNormal (None, 23, 35, 864) 3456 conv4_block19_concat[0][0]


conv4_block20_0_relu (Activatio (None, 23, 35, 864) 0 conv4_block20_0_bn[0][0]


conv4_block20_1_conv (Conv2D) (None, 23, 35, 128) 110592 conv4_block20_0_relu[0][0]


conv4_block20_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block20_1_conv[0][0]


conv4_block20_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block20_1_bn[0][0]


conv4_block20_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block20_1_relu[0][0]


conv4_block20_concat (Concatena (None, 23, 35, 896) 0 conv4_block19_concat[0][0] conv4_block20_2_conv[0][0]


conv4_block21_0_bn (BatchNormal (None, 23, 35, 896) 3584 conv4_block20_concat[0][0]


conv4_block21_0_relu (Activatio (None, 23, 35, 896) 0 conv4_block21_0_bn[0][0]


conv4_block21_1_conv (Conv2D) (None, 23, 35, 128) 114688 conv4_block21_0_relu[0][0]


conv4_block21_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block21_1_conv[0][0]


conv4_block21_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block21_1_bn[0][0]


conv4_block21_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block21_1_relu[0][0]


conv4_block21_concat (Concatena (None, 23, 35, 928) 0 conv4_block20_concat[0][0] conv4_block21_2_conv[0][0]


conv4_block22_0_bn (BatchNormal (None, 23, 35, 928) 3712 conv4_block21_concat[0][0]


conv4_block22_0_relu (Activatio (None, 23, 35, 928) 0 conv4_block22_0_bn[0][0]


conv4_block22_1_conv (Conv2D) (None, 23, 35, 128) 118784 conv4_block22_0_relu[0][0]


conv4_block22_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block22_1_conv[0][0]


conv4_block22_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block22_1_bn[0][0]


conv4_block22_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block22_1_relu[0][0]


conv4_block22_concat (Concatena (None, 23, 35, 960) 0 conv4_block21_concat[0][0] conv4_block22_2_conv[0][0]


conv4_block23_0_bn (BatchNormal (None, 23, 35, 960) 3840 conv4_block22_concat[0][0]


conv4_block23_0_relu (Activatio (None, 23, 35, 960) 0 conv4_block23_0_bn[0][0]


conv4_block23_1_conv (Conv2D) (None, 23, 35, 128) 122880 conv4_block23_0_relu[0][0]


conv4_block23_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block23_1_conv[0][0]


conv4_block23_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block23_1_bn[0][0]


conv4_block23_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block23_1_relu[0][0]


conv4_block23_concat (Concatena (None, 23, 35, 992) 0 conv4_block22_concat[0][0] conv4_block23_2_conv[0][0]


conv4_block24_0_bn (BatchNormal (None, 23, 35, 992) 3968 conv4_block23_concat[0][0]


conv4_block24_0_relu (Activatio (None, 23, 35, 992) 0 conv4_block24_0_bn[0][0]


conv4_block24_1_conv (Conv2D) (None, 23, 35, 128) 126976 conv4_block24_0_relu[0][0]


conv4_block24_1_bn (BatchNormal (None, 23, 35, 128) 512 conv4_block24_1_conv[0][0]


conv4_block24_1_relu (Activatio (None, 23, 35, 128) 0 conv4_block24_1_bn[0][0]


conv4_block24_2_conv (Conv2D) (None, 23, 35, 32) 36864 conv4_block24_1_relu[0][0]


conv4_block24_concat (Concatena (None, 23, 35, 1024) 0 conv4_block23_concat[0][0] conv4_block24_2_conv[0][0]


pool4_bn (BatchNormalization) (None, 23, 35, 1024) 4096 conv4_block24_concat[0][0]


pool4_relu (Activation) (None, 23, 35, 1024) 0 pool4_bn[0][0]


pool4_conv (Conv2D) (None, 23, 35, 512) 524288 pool4_relu[0][0]


pool4_pool (AveragePooling2D) (None, 11, 17, 512) 0 pool4_conv[0][0]


conv5_block1_0_bn (BatchNormali (None, 11, 17, 512) 2048 pool4_pool[0][0]


conv5_block1_0_relu (Activation (None, 11, 17, 512) 0 conv5_block1_0_bn[0][0]


conv5_block1_1_conv (Conv2D) (None, 11, 17, 128) 65536 conv5_block1_0_relu[0][0]


conv5_block1_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block1_1_conv[0][0]


conv5_block1_1_relu (Activation (None, 11, 17, 128) 0 conv5_block1_1_bn[0][0]


conv5_block1_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block1_1_relu[0][0]


conv5_block1_concat (Concatenat (None, 11, 17, 544) 0 pool4_pool[0][0] conv5_block1_2_conv[0][0]


conv5_block2_0_bn (BatchNormali (None, 11, 17, 544) 2176 conv5_block1_concat[0][0]


conv5_block2_0_relu (Activation (None, 11, 17, 544) 0 conv5_block2_0_bn[0][0]


conv5_block2_1_conv (Conv2D) (None, 11, 17, 128) 69632 conv5_block2_0_relu[0][0]


conv5_block2_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block2_1_conv[0][0]


conv5_block2_1_relu (Activation (None, 11, 17, 128) 0 conv5_block2_1_bn[0][0]


conv5_block2_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block2_1_relu[0][0]


conv5_block2_concat (Concatenat (None, 11, 17, 576) 0 conv5_block1_concat[0][0] conv5_block2_2_conv[0][0]


conv5_block3_0_bn (BatchNormali (None, 11, 17, 576) 2304 conv5_block2_concat[0][0]


conv5_block3_0_relu (Activation (None, 11, 17, 576) 0 conv5_block3_0_bn[0][0]


conv5_block3_1_conv (Conv2D) (None, 11, 17, 128) 73728 conv5_block3_0_relu[0][0]


conv5_block3_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block3_1_conv[0][0]


conv5_block3_1_relu (Activation (None, 11, 17, 128) 0 conv5_block3_1_bn[0][0]


conv5_block3_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block3_1_relu[0][0]


conv5_block3_concat (Concatenat (None, 11, 17, 608) 0 conv5_block2_concat[0][0] conv5_block3_2_conv[0][0]


conv5_block4_0_bn (BatchNormali (None, 11, 17, 608) 2432 conv5_block3_concat[0][0]


conv5_block4_0_relu (Activation (None, 11, 17, 608) 0 conv5_block4_0_bn[0][0]


conv5_block4_1_conv (Conv2D) (None, 11, 17, 128) 77824 conv5_block4_0_relu[0][0]


conv5_block4_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block4_1_conv[0][0]


conv5_block4_1_relu (Activation (None, 11, 17, 128) 0 conv5_block4_1_bn[0][0]


conv5_block4_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block4_1_relu[0][0]


conv5_block4_concat (Concatenat (None, 11, 17, 640) 0 conv5_block3_concat[0][0] conv5_block4_2_conv[0][0]


conv5_block5_0_bn (BatchNormali (None, 11, 17, 640) 2560 conv5_block4_concat[0][0]


conv5_block5_0_relu (Activation (None, 11, 17, 640) 0 conv5_block5_0_bn[0][0]


conv5_block5_1_conv (Conv2D) (None, 11, 17, 128) 81920 conv5_block5_0_relu[0][0]


conv5_block5_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block5_1_conv[0][0]


conv5_block5_1_relu (Activation (None, 11, 17, 128) 0 conv5_block5_1_bn[0][0]


conv5_block5_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block5_1_relu[0][0]


conv5_block5_concat (Concatenat (None, 11, 17, 672) 0 conv5_block4_concat[0][0] conv5_block5_2_conv[0][0]


conv5_block6_0_bn (BatchNormali (None, 11, 17, 672) 2688 conv5_block5_concat[0][0]


conv5_block6_0_relu (Activation (None, 11, 17, 672) 0 conv5_block6_0_bn[0][0]


conv5_block6_1_conv (Conv2D) (None, 11, 17, 128) 86016 conv5_block6_0_relu[0][0]


conv5_block6_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block6_1_conv[0][0]


conv5_block6_1_relu (Activation (None, 11, 17, 128) 0 conv5_block6_1_bn[0][0]


conv5_block6_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block6_1_relu[0][0]


conv5_block6_concat (Concatenat (None, 11, 17, 704) 0 conv5_block5_concat[0][0] conv5_block6_2_conv[0][0]


conv5_block7_0_bn (BatchNormali (None, 11, 17, 704) 2816 conv5_block6_concat[0][0]


conv5_block7_0_relu (Activation (None, 11, 17, 704) 0 conv5_block7_0_bn[0][0]


conv5_block7_1_conv (Conv2D) (None, 11, 17, 128) 90112 conv5_block7_0_relu[0][0]


conv5_block7_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block7_1_conv[0][0]


conv5_block7_1_relu (Activation (None, 11, 17, 128) 0 conv5_block7_1_bn[0][0]


conv5_block7_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block7_1_relu[0][0]


conv5_block7_concat (Concatenat (None, 11, 17, 736) 0 conv5_block6_concat[0][0] conv5_block7_2_conv[0][0]


conv5_block8_0_bn (BatchNormali (None, 11, 17, 736) 2944 conv5_block7_concat[0][0]


conv5_block8_0_relu (Activation (None, 11, 17, 736) 0 conv5_block8_0_bn[0][0]


conv5_block8_1_conv (Conv2D) (None, 11, 17, 128) 94208 conv5_block8_0_relu[0][0]


conv5_block8_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block8_1_conv[0][0]


conv5_block8_1_relu (Activation (None, 11, 17, 128) 0 conv5_block8_1_bn[0][0]


conv5_block8_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block8_1_relu[0][0]


conv5_block8_concat (Concatenat (None, 11, 17, 768) 0 conv5_block7_concat[0][0] conv5_block8_2_conv[0][0]


conv5_block9_0_bn (BatchNormali (None, 11, 17, 768) 3072 conv5_block8_concat[0][0]


conv5_block9_0_relu (Activation (None, 11, 17, 768) 0 conv5_block9_0_bn[0][0]


conv5_block9_1_conv (Conv2D) (None, 11, 17, 128) 98304 conv5_block9_0_relu[0][0]


conv5_block9_1_bn (BatchNormali (None, 11, 17, 128) 512 conv5_block9_1_conv[0][0]


conv5_block9_1_relu (Activation (None, 11, 17, 128) 0 conv5_block9_1_bn[0][0]


conv5_block9_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block9_1_relu[0][0]


conv5_block9_concat (Concatenat (None, 11, 17, 800) 0 conv5_block8_concat[0][0] conv5_block9_2_conv[0][0]


conv5_block10_0_bn (BatchNormal (None, 11, 17, 800) 3200 conv5_block9_concat[0][0]


conv5_block10_0_relu (Activatio (None, 11, 17, 800) 0 conv5_block10_0_bn[0][0]


conv5_block10_1_conv (Conv2D) (None, 11, 17, 128) 102400 conv5_block10_0_relu[0][0]


conv5_block10_1_bn (BatchNormal (None, 11, 17, 128) 512 conv5_block10_1_conv[0][0]


conv5_block10_1_relu (Activatio (None, 11, 17, 128) 0 conv5_block10_1_bn[0][0]


conv5_block10_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block10_1_relu[0][0]


conv5_block10_concat (Concatena (None, 11, 17, 832) 0 conv5_block9_concat[0][0] conv5_block10_2_conv[0][0]


conv5_block11_0_bn (BatchNormal (None, 11, 17, 832) 3328 conv5_block10_concat[0][0]


conv5_block11_0_relu (Activatio (None, 11, 17, 832) 0 conv5_block11_0_bn[0][0]


conv5_block11_1_conv (Conv2D) (None, 11, 17, 128) 106496 conv5_block11_0_relu[0][0]


conv5_block11_1_bn (BatchNormal (None, 11, 17, 128) 512 conv5_block11_1_conv[0][0]


conv5_block11_1_relu (Activatio (None, 11, 17, 128) 0 conv5_block11_1_bn[0][0]


conv5_block11_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block11_1_relu[0][0]


conv5_block11_concat (Concatena (None, 11, 17, 864) 0 conv5_block10_concat[0][0] conv5_block11_2_conv[0][0]


conv5_block12_0_bn (BatchNormal (None, 11, 17, 864) 3456 conv5_block11_concat[0][0]


conv5_block12_0_relu (Activatio (None, 11, 17, 864) 0 conv5_block12_0_bn[0][0]


conv5_block12_1_conv (Conv2D) (None, 11, 17, 128) 110592 conv5_block12_0_relu[0][0]


conv5_block12_1_bn (BatchNormal (None, 11, 17, 128) 512 conv5_block12_1_conv[0][0]


conv5_block12_1_relu (Activatio (None, 11, 17, 128) 0 conv5_block12_1_bn[0][0]


conv5_block12_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block12_1_relu[0][0]


conv5_block12_concat (Concatena (None, 11, 17, 896) 0 conv5_block11_concat[0][0] conv5_block12_2_conv[0][0]


conv5_block13_0_bn (BatchNormal (None, 11, 17, 896) 3584 conv5_block12_concat[0][0]


conv5_block13_0_relu (Activatio (None, 11, 17, 896) 0 conv5_block13_0_bn[0][0]


conv5_block13_1_conv (Conv2D) (None, 11, 17, 128) 114688 conv5_block13_0_relu[0][0]


conv5_block13_1_bn (BatchNormal (None, 11, 17, 128) 512 conv5_block13_1_conv[0][0]


conv5_block13_1_relu (Activatio (None, 11, 17, 128) 0 conv5_block13_1_bn[0][0]


conv5_block13_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block13_1_relu[0][0]


conv5_block13_concat (Concatena (None, 11, 17, 928) 0 conv5_block12_concat[0][0] conv5_block13_2_conv[0][0]


conv5_block14_0_bn (BatchNormal (None, 11, 17, 928) 3712 conv5_block13_concat[0][0]


conv5_block14_0_relu (Activatio (None, 11, 17, 928) 0 conv5_block14_0_bn[0][0]


conv5_block14_1_conv (Conv2D) (None, 11, 17, 128) 118784 conv5_block14_0_relu[0][0]


conv5_block14_1_bn (BatchNormal (None, 11, 17, 128) 512 conv5_block14_1_conv[0][0]


conv5_block14_1_relu (Activatio (None, 11, 17, 128) 0 conv5_block14_1_bn[0][0]


conv5_block14_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block14_1_relu[0][0]


conv5_block14_concat (Concatena (None, 11, 17, 960) 0 conv5_block13_concat[0][0] conv5_block14_2_conv[0][0]


conv5_block15_0_bn (BatchNormal (None, 11, 17, 960) 3840 conv5_block14_concat[0][0]


conv5_block15_0_relu (Activatio (None, 11, 17, 960) 0 conv5_block15_0_bn[0][0]


conv5_block15_1_conv (Conv2D) (None, 11, 17, 128) 122880 conv5_block15_0_relu[0][0]


conv5_block15_1_bn (BatchNormal (None, 11, 17, 128) 512 conv5_block15_1_conv[0][0]


conv5_block15_1_relu (Activatio (None, 11, 17, 128) 0 conv5_block15_1_bn[0][0]


conv5_block15_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block15_1_relu[0][0]


conv5_block15_concat (Concatena (None, 11, 17, 992) 0 conv5_block14_concat[0][0] conv5_block15_2_conv[0][0]


conv5_block16_0_bn (BatchNormal (None, 11, 17, 992) 3968 conv5_block15_concat[0][0]


conv5_block16_0_relu (Activatio (None, 11, 17, 992) 0 conv5_block16_0_bn[0][0]


conv5_block16_1_conv (Conv2D) (None, 11, 17, 128) 126976 conv5_block16_0_relu[0][0]


conv5_block16_1_bn (BatchNormal (None, 11, 17, 128) 512 conv5_block16_1_conv[0][0]


conv5_block16_1_relu (Activatio (None, 11, 17, 128) 0 conv5_block16_1_bn[0][0]


conv5_block16_2_conv (Conv2D) (None, 11, 17, 32) 36864 conv5_block16_1_relu[0][0]


conv5_block16_concat (Concatena (None, 11, 17, 1024) 0 conv5_block15_concat[0][0] conv5_block16_2_conv[0][0]


bn (BatchNormalization) (None, 11, 17, 1024) 4096 conv5_block16_concat[0][0]


relu (Activation) (None, 11, 17, 1024) 0 bn[0][0]


avg_pool (GlobalAveragePooling2 (None, 1024) 0 relu[0][0]


fc1000 (Dense) (None, 10) 10250 avg_pool[0][0]

Total params: 7,047,754 Trainable params: 6,964,106 Non-trainable params: 83,648


Train for 100 steps, validate for 10 steps WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/normalization.py:477: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where 2019-09-19 11:25:34.482086: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcublas.so.10.0 2019-09-19 11:25:34.711640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamiclibrary libcudnn.so.7 2019-09-19 11:25:35.685779: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found Relying on driver to perform ptx compilation. This message will be only logged once. </pre> </details>

If I remove the MirroredStrategy scope, the code runs successfully and does not hang (doing meaningless training).

Investigation

<details> <summary><code>top</code></summary> <pre> 3161 root 20 0 0.112t 0.013t 948384 S 24.0 5.3 181:17.23 python3 </pre> </details>

nvidia-smi's output is the same that I used in the "System information": all the GPUs are constantly 100% busy.

<details> <summary><code>top -H -p 3161</code> - threads of the running process</summary> <pre> Threads: 155 total, 0 running, 155 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.9 us, 0.8 sy, 0.0 ni, 97.8 id, 0.0 wa, 0.3 hi, 0.2 si, 0.0 st KiB Mem : 26408952+total, 99229216 free, 21207464 used, 14365283+buff/cache KiB Swap: 0 total, 0 free, 0 used. 20145740+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3261 root 20 0 0.112t 0.013t 948360 S 6.3 5.3 42:18.36 python3 3255 root 20 0 0.112t 0.013t 948360 S 6.0 5.3 41:49.75 python3 3259 root 20 0 0.112t 0.013t 948360 S 6.0 5.3 42:09.41 python3 3257 root 20 0 0.112t 0.013t 948360 S 5.6 5.3 42:10.03 python3 3161 root 20 0 0.112t 0.013t 948360 S 0.0 5.3 2:11.62 python3 3165 root 20 0 0.112t 0.013t 948360 S 0.0 5.3 0:00.00 python3 3166 root 20 0 0.112t 0.013t 948360 S 0.0 5.3 0:15.45 python3 ... </pre> </details>

<details> <summary><code>bt</code> in <code>gdb --pid 3161</code> - trace of the main thread</summary> <pre> #0 0x00007f26924c5839 in syscall () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f264b30e53b in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_, timespec) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.so #2 0x00007f264b30db59 in nsync::nsync_sem_wait_with_cancel(nsync::waiter, timespec, nsync::nsync_note_s_) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.so #3 0x00007f264b30b11b in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s, void*, void ()(void), void ()(void), timespec, nsync::nsync_note_s_) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.so #4 0x00007f264b30b5f3 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s, nsync::nsync_mu_s_, timespec, nsync::nsync_note_s_) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #5 0x00007f264344f60c in tensorflow::KernelAndDeviceFunc::Run(tensorflow::ScopedStepContainer*, absl::InlinedVector<tensorflow::TensorValue, 4ul, std::allocatortensorflow::TensorValue > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::NodeExecStats, tensorflow::StepStats*, tensorflow::GraphCollector*, tensorflow::CancellationManager*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #6 0x00007f264344fa06 in tensorflow::KernelAndDeviceFunc::Run(absl::InlinedVector<tensorflow::TensorValue, 4ul, std::allocatortensorflow::TensorValue > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::NodeExecStats, tensorflow::StepStats*, tensorflow::GraphCollector*, tensorflow::CancellationManager*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #7 0x00007f26434313f6 in tensorflow::EagerKernelExecute(tensorflow::EagerContext*, absl::InlinedVector<tensorflow::TensorHandle*, 4ul, std::allocatortensorflow::TensorHandle* > const&, std::unique_ptr<tensorflow::KernelAndDevice, tensorflow::core::RefCountDeleter> const&, tensorflow::NodeExecStats*, tensorflow::StepStats*, tensorflow::GraphCollector*, tensorflow::CancellationManager*, absl::Spantensorflow::TensorHandle*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #8 0x00007f2643431aed in tensorflow::ExecuteNode::Run() () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #9 0x00007f264346ca85 in tensorflow::EagerExecutor::RunItem(std::unique_ptr<tensorflow::EagerExecutor::NodeItem, tensorflow::core::RefCountDeleter>) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #10 0x00007f264346d18d in tensorflow::EagerExecutor::AddOrExecute(std::unique_ptr<tensorflow::EagerNode, std::default_deletetensorflow::EagerNode >) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #11 0x00007f264342cd86 in tensorflow::(anonymous namespace)::EagerLocalExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #12 0x00007f264342ed00 in tensorflow::EagerExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) () ---Type <return> to continue, or q <return> to quit--- from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #13 0x00007f26432bc05d in TFE_Execute () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #14 0x00007f264324640c in TFE_Py_ExecuteCancelable(TFE_Context*, char const*, char const*, absl::InlinedVector<TFE_TensorHandle*, 4ul, std::allocator<TFE_TensorHandle*> >, _object, TFE_CancellationManager*, absl::InlinedVector<TFE_TensorHandle*, 2ul, std::allocator<TFE_TensorHandle*> >, TF_Status) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #15 0x00007f2643246941 in TFE_Py_Execute(TFE_Context*, char const*, char const*, absl::InlinedVector<TFE_TensorHandle*,4ul, std::allocator<TFE_TensorHandle*> >, _object, absl::InlinedVector<TFE_TensorHandle*, 2ul, std::allocator<TFE_TensorHandle*> >, TF_Status) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #16 0x00007f2642ddeb34 in _wrap_TFE_Py_Execute () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #17 0x00000000005097cf in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=<optimized out>, func_obj=<built-in method TFE_Py_Execute of module object at remote 0x7f26805d2778>) at ../Objects/methodobject.c:234 #18 _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294 #19 call_function.lto_priv () at ../Python/ceval.c:4851 #20 0x000000000050b4a9 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335 #21 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f= Frame 0x62d109a8, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py, line 61,in quick_execute (op_name='__inference_distributed_function_164755', num_outputs=3, inputs=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f256431f198>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f256431f2e8>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f25642d2c18>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f263badc6d8>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f260c506cc0>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f260c50f8d0>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f260c506780>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f260c49d2e8>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f260c50fc18>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f260c420d68>, <tensorflow.python.framework.ops.EagerTensor at remote 0x7f260c420630>, <tensorflow.python.frame...(truncated)) at ../Python/ceval.c:754 #22 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #23 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #24 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #25 0x000000000050c36e in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3351 #26 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x71ccbef8, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py, line 4---Type <return> to continue, or q <return> to quit--- 95, in call (self=<_EagerDefinedFunction(name=b'__inference_distributed_function_164755', _function_deleter=<_EagerDefinedFunctionDeleter(name=b'__inference_distributed_function_164755') at remote 0x7f1e0e0df438>, _registered_on_context=True, definition=<FunctionDef at remote 0x7f24bc06bfa8>, signature=<OpDef at remote 0x7f24bc06bef8>, _num_outputs=3, _output_types=[9, 1, 1], _output_shapes=[<TensorShape(_dims=[]) at remote 0x7f2384537a90>, <TensorShape(_dims=[]) at remote 0x7f2384537518>, <TensorShape(_dims=[]) at remote 0x7f2384537e80>], _control_captures=set(), _func_graph_outputs=[<Tensor(_op=<Operation(_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f25642c78d0>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lock at remote 0x7f24c4746288>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f24c4746288>, release=<built-in method release of _thread.lock object at...(truncated)) at ../Python/ceval.c:754 #27 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #28 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #29 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #30 0x000000000050c36e in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3351 #31 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x71ccb5b8, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py, line 1600, in _call_flat (self=<ConcreteFunction(_arg_keywords=None, _num_positional_args=None, _func_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f25642c78d0>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lock at remote 0x7f24c4746288>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f24c4746288>, release=<built-in method release of _thread.lock object at remote 0x7f24c4746288>, _waiters=<collections.deque at remote 0x7f24e44428d0>) at remote0x7f2384537f60>, _num_groups=2, _group_member_counts=[0, 0]) at remote 0x7f2384537c88>, _nodes_by_id={1: <Operation(_graph=<...>, _inputs_val=(), _id_value=1, _original_op=None, _traceback=<tensorflow_core.python._tf_stack.StackSummary at remote 0x7f23844c6fb8>, _device_code_locations=[<TraceableObject(obj='/job:localhost/replica:0/task:0/device:GPU:0', filename='/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/fr...(truncated)) at ../Python/ceval.c:754 #32 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #33 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #34 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #35 0x000000000050b4a9 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335 #36 0x0000000000508c69 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7f18b8000b38, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py, line 1515, in _filtered_call (self=<ConcreteFunction(_arg_keywords=None, _num_positional_args=None, _func_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f25642c78d0>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lock at remote0x7f24c4746288>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f24c4746288>, release=<built-in method release of _thread.lock object at remote 0x7f24c4746288>, _waiters=<collections.deque at remote 0x7f24e44428d0>) at remote 0x7f2384537f60>, _num_groups=2, _group_member_counts=[0, 0]) at remote 0x7f2384537c88>, _nodes_by_id={1: <Operation(_graph=<...>, _inputs_val=(), _id_value=1, _original_op=None, _traceback=<tensorflow_core.python._tf_stack.StackSummary at remote 0x7f23844c6fb8>, _device_code_locations=[<TraceableObject(obj='/job:localhost/replica:0/task:0/device:GPU:0', filename='/usr/local/lib/python3.6/dist-packages/tensorflow_core/p...(truncated)) at ../Python/ceval.c:754 ---Type <return> to continue, or q <return> to quit--- #37 _PyFunction_FastCall (globals=<optimized out>, nargs=139744142953272, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933 #38 fast_function.lto_priv () at ../Python/ceval.c:4968 #39 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #40 0x000000000050b4a9 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335 #41 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x1d37bb48, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py, line 2237, in call (self=<Function(_python_function=<function at remote 0x7f2635ff3a60>, _function_spec=<FunctionSpec(_fullargspec=<FullArgSpec at remote 0x7f24942b4eb8>, _is_method=False, _default_values=None, _args_to_indices={'input_iterator': 0}, arg_names=['input_iterator'], vararg_name=None, _arg_indices_to_default_values={}, _input_signature=None) at remote 0x7f25642e3630>, _name='distributed_function', _autograph=False, _autograph_options=None, _experimental_relax_shapes=False, _function_cache=<FunctionCache(missed={<CacheKey at remote 0x7f244a21be28>}, primary={<CacheKey at remote 0x7f244a21bd68>: <ConcreteFunction(_arg_keywords=None, _num_positional_args=None, _func_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f25642c78d0>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lock at remote 0x7f24c4746288>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f24c4...(truncated)) at ../Python/ceval.c:754 #42 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #43 0x0000000000508794 in _PyFunction_FastCallDict () at ../Python/ceval.c:5084 #44 0x00000000005940d1 in _PyObject_FastCallDict (kwargs={}, nargs=2, args=0x7ffcaa451a50, func=<function at remote 0x7f263bd949d8>) at ../Objects/abstract.c:2310 #45 _PyObject_Call_Prepend (kwargs={}, args=<optimized out>, obj=<optimized out>, func=<function at remote 0x7f263bd949d8>) at ../Objects/abstract.c:2373 #46 method_call.lto_priv () at ../Objects/classobject.c:314 #47 0x0000000000549f41 in PyObject_Call (kwargs={}, args=(<DistributedIterator(_enable_get_next_as_optional=False, _iterators=[<_SingleWorkerDatasetIterator(_dataset=<_AutoShardDataset(_input_dataset=<_OptionsDataset(_input_dataset=<_OptionsDataset(_input_dataset=<PrefetchDataset(_input_dataset=<_RebatchDataset(_input_dataset=<MapDataset(_input_dataset=<BatchDataset(_input_dataset=<RepeatDataset(_input_dataset=<MapDataset(_input_dataset=<TensorDataset(_structure=<TensorSpec at remote 0x7f26295ffe10>, _tensors=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d514438>], _variant_tensor_attr=<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d5148d0>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[<TrackableReference at remote 0x7f26295ffd80>], _self_unconditional_dependency_names={'_variant_tracker': <_VariantTracker(_resource_handle=<...>, _resource_device='CPU', _resource_deleter=<CapturableResourceDeleter(_destroy_resource=None) at remote 0x7f263afb4400>, _create_resource=<function at remote 0x7f263bb23620>, _sel...(truncated), func=<method at remote 0x7f25643a5d88>) at ../Objects/abstract.c:2261 #48 slot_tp_call () at ../Objects/typeobject.c:6207 #49 0x000000000059f50e in PyObject_Call () at ../Objects/abstract.c:2261 #50 0x000000000050c854 in do_call_core (kwdict={}, ---Type <return> to continue, or q <return> to quit--- callargs=(<DistributedIterator(_enable_get_next_as_optional=False, _iterators=[<_SingleWorkerDatasetIterator(_dataset=<_AutoShardDataset(_input_dataset=<_OptionsDataset(_input_dataset=<_OptionsDataset(_input_dataset=<PrefetchDataset(_input_dataset=<_RebatchDataset(_input_dataset=<MapDataset(_input_dataset=<BatchDataset(_input_dataset=<RepeatDataset(_input_dataset=<MapDataset(_input_dataset=<TensorDataset(_structure=<TensorSpec at remote 0x7f26295ffe10>, _tensors=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d514438>], _variant_tensor_attr=<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d5148d0>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[<TrackableReference at remote 0x7f26295ffd80>], _self_unconditional_dependency_names={'_variant_tracker': <_VariantTracker(_resource_handle=<...>, _resource_device='CPU', _resource_deleter=<CapturableResourceDeleter(_destroy_resource=None) at remote 0x7f263afb4400>, _create_resource=<function at remote 0x7f263bb23620>, _sel...(truncated), func=<Function(_python_function=<function at remote 0x7f2635ff3a60>, _function_spec=<FunctionSpec(_fullargspec=<FullArgSpec at remote 0x7f24942b4eb8>, _is_method=False, _default_values=None, _args_to_indices={'input_iterator': 0}, arg_names=['input_iterator'], vararg_name=None, _arg_indices_to_default_values={}, _input_signature=None) at remote 0x7f25642e3630>, _name='distributed_function', _autograph=False, _autograph_options=None, _experimental_relax_shapes=False, _function_cache=<FunctionCache(missed={<CacheKey at remote 0x7f244a21be28>}, primary={<CacheKey at remote 0x7f244a21bd68>: <ConcreteFunction(_arg_keywords=None, _num_positional_args=None, _func_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f25642c78d0>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lock at remote 0x7f24c4746288>, acquire=<built-inmethod acquire of _thread.lock object at remote 0x7f24c4746288>, release=<built-in method release of _thread.lock object at remote 0x7f24c4746288>, _waiters=<collections.deque at remote 0x7f24e...(truncated)) at ../Python/ceval.c:5120 #51 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404 #52 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x68702018, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py, line 543, in _call (args=(<DistributedIterator(_enable_get_next_as_optional=False, _iterators=[<_SingleWorkerDatasetIterator(_dataset=<_AutoShardDataset(_input_dataset=<_OptionsDataset(_input_dataset=<_OptionsDataset(_input_dataset=<PrefetchDataset(_input_dataset=<_RebatchDataset(_input_dataset=<MapDataset(_input_dataset=<BatchDataset(_input_dataset=<RepeatDataset(_input_dataset=<MapDataset(_input_dataset=<TensorDataset(_structure=<TensorSpec at remote 0x7f26295ffe10>, _tensors=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d514438>], _variant_tensor_attr=<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d5148d0>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[<TrackableReference at remote 0x7f26295ffd80>], _self_unconditional_dependency_names={'_variant_tracker': <_VariantTracker(_resource_handle=<...>, _resource_device='CPU', _resource_deleter...(truncated)) at ../Python/ceval.c:754 #53 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #54 0x0000000000508794 in _PyFunction_FastCallDict () at ../Python/ceval.c:5084 #55 0x00000000005940d1 in _PyObject_FastCallDict (kwargs={}, nargs=2, args=0x7ffcaa451e10, func=<function at remote 0x7f263bdae048>) at ../Objects/abstract.c:2310 #56 _PyObject_Call_Prepend (kwargs={}, args=<optimized out>, obj=<optimized out>, func=<function at remote 0x7f263bdae048>) at ../Objects/abstract.c:2373 #57 method_call.lto_priv () at ../Objects/classobject.c:314 ---Type <return> to continue, or q <return> to quit--- #58 0x000000000059f50e in PyObject_Call () at ../Objects/abstract.c:2261 #59 0x000000000050c854 in do_call_core (kwdict={}, callargs=(<DistributedIterator(_enable_get_next_as_optional=False, _iterators=[<_SingleWorkerDatasetIterator(_dataset=<_AutoShardDataset(_input_dataset=<_OptionsDataset(_input_dataset=<_OptionsDataset(_input_dataset=<PrefetchDataset(_input_dataset=<_RebatchDataset(_input_dataset=<MapDataset(_input_dataset=<BatchDataset(_input_dataset=<RepeatDataset(_input_dataset=<MapDataset(_input_dataset=<TensorDataset(_structure=<TensorSpec at remote 0x7f26295ffe10>, _tensors=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d514438>], _variant_tensor_attr=<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d5148d0>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[<TrackableReference at remote 0x7f26295ffd80>], _self_unconditional_dependency_names={'_variant_tracker': <_VariantTracker(_resource_handle=<...>, _resource_device='CPU', _resource_deleter=<CapturableResourceDeleter(_destroy_resource=None) at remote 0x7f263afb4400>, _create_resource=<function at remote 0x7f263bb23620>, _sel...(truncated), func=<method at remote 0x7f25b05c7f88>) at ../Python/ceval.c:5120 #60 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404 #61 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7f2564359dd8, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py, line 480, in call (self=<Function(_lock=<_thread.lock at remote 0x7f2564374df0>, _python_function=<function at remote 0x7f2564495f28>, _function_spec=<FunctionSpec(_fullargspec=<FullArgSpec at remote 0x7f25644326d8>, _is_method=False, _default_values=None, _args_to_indices={'input_iterator': 0}, arg_names=['input_iterator'], vararg_name=None, _arg_indices_to_default_values={}, _input_signature=None) at remote 0x7f256435b400>, _autograph=False, _experimental_autograph_options=None, experimental_relax_shapes=False, _experimental_compile=None, _created_variables=[<weakref at remote 0x7f256418ea48>, <weakref at remote 0x7f256418eae8>, <weakref at remote 0x7f256418ebd8>, <weakref at remote 0x7f256418ed18>, <weakref at remote 0x7f256418ed68>, <weakref at remote 0x7f256418eef8>, <weakref at remote 0x7f252832d098>, <weakref at remote 0x7f252832d188>, <weakref at remote 0x7f252832d228>, <weakref at r...(truncated)) at ../Python/ceval.c:754 #62 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #63 0x0000000000508537 in _PyFunction_FastCallDict () at ../Python/ceval.c:5075 #64 0x00000000005940d1 in _PyObject_FastCallDict (kwargs=0x0, nargs=2, args=0x7ffcaa452190, func=<function at remote 0x7f263bdbef28>) at ../Objects/abstract.c:2310 #65 _PyObject_Call_Prepend (kwargs=0x0, args=<optimized out>, obj=<optimized out>, func=<function at remote 0x7f263bdbef28>) at ../Objects/abstract.c:2373 #66 method_call.lto_priv () at ../Objects/classobject.c:314 #67 0x0000000000549f41 in PyObject_Call (kwargs=0x0, args=(<DistributedIterator(_enable_get_next_as_optional=False, _iterators=[<_SingleWorkerDatasetIterator(_dataset=<_AutoShardDataset(_input_dataset=<_OptionsDataset(_input_dataset=<_OptionsDataset(_input_dataset=<PrefetchDataset(_input_dataset=<_RebatchDataset(_input_dataset=<MapDataset(_input_dataset=<BatchDataset(_input_dataset=<RepeatDataset(_input_dataset=<MapDataset(_input_dataset=<TensorDataset(_structure=<TensorSpec at remote 0x7f26295ffe10>, _tensors=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d514438>], _variant_tensor_attr=<tensorflow.python.framework.ops.Eager---Type <return> to continue, or q <return> to quit--- Tensor at remote 0x7f263d5148d0>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[<TrackableReference at remote 0x7f26295ffd80>], _self_unconditional_dependency_names={'_variant_tracker': <_VariantTracker(_resource_handle=<...>, _resource_device='CPU', _resource_deleter=<CapturableResourceDeleter(_destroy_resource=None) at remote 0x7f263afb4400>, _create_resource=<function at remote 0x7f263bb23620>, _sel...(truncated), func=<method at remote 0x7f26914e20c8>) at ../Objects/abstract.c:2261 #68 slot_tp_call () at ../Objects/typeobject.c:6207 #69 0x00000000005a95fc in _PyObject_FastCallDict (kwargs=<optimized out>, nargs=1, args=0x7f25642fdc98, func=<Function(_lock=<_thread.lock at remote 0x7f2564374df0>, _python_function=<function at remote 0x7f2564495f28>,_function_spec=<FunctionSpec(_fullargspec=<FullArgSpec at remote 0x7f25644326d8>, _is_method=False, _default_values=None, _args_to_indices={'input_iterator': 0}, arg_names=['input_iterator'], vararg_name=None, _arg_indices_to_default_values={}, _input_signature=None) at remote 0x7f256435b400>, _autograph=False, _experimental_autograph_options=None, experimental_relax_shapes=False, _experimental_compile=None, _created_variables=[<weakref at remote 0x7f256418ea48>, <weakref atremote 0x7f256418eae8>, <weakref at remote 0x7f256418ebd8>, <weakref at remote 0x7f256418ed18>, <weakref at remote 0x7f256418ed68>, <weakref at remote 0x7f256418eef8>, <weakref at remote 0x7f252832d098>, <weakref at remote 0x7f252832d188>,<weakref at remote 0x7f252832d228>, <weakref at remote 0x7f252832d278>, <weakref at remote 0x7f252832d1d8>, <weakref atremote 0x7f252832d318>, <weakref at remote 0x7f252832d4a8>, <weakref at r...(truncated)) at ../Objects/tupleobject.c:131 #70 _PyObject_FastCallKeywords () at ../Objects/abstract.c:2496 #71 0x0000000000509ad3 in call_function.lto_priv () at ../Python/ceval.c:4875 #72 0x000000000050b4a9 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335 #73 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7f25642fdaf8, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2_utils.py, line 86, in execution_function (input_fn=<DistributedIterator(_enable_get_next_as_optional=False, _iterators=[<_SingleWorkerDatasetIterator(_dataset=<_AutoShardDataset(_input_dataset=<_OptionsDataset(_input_dataset=<_OptionsDataset(_input_dataset=<PrefetchDataset(_input_dataset=<_RebatchDataset(_input_dataset=<MapDataset(_input_dataset=<BatchDataset(_input_dataset=<RepeatDataset(_input_dataset=<MapDataset(_input_dataset=<TensorDataset(_structure=<TensorSpec at remote 0x7f26295ffe10>, _tensors=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d514438>], _variant_tensor_attr=<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d5148d0>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[<TrackableReference at remote 0x7f26295ffd80>], _self_unconditional_dependency_names={'_variant_tracker': <_VariantTracker(_resource_handle=<...>, resource...(truncated)) at ../Python/ceval.c:754 #74 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #75 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #76 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #77 0x000000000050b4a9 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335 #78 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x689353d8, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.---Type <return> to continue, or q <return> to quit--- py, line 123, in run_one_epoch (model=<Model(_self_setattr_tracking=True, _nested_outputs=<Tensor(_op=<Operation(_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f262967f690>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lockat remote 0x7f260c4a7f30>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f260c4a7f30>, release=<built-in method release of _thread.lock object at remote 0x7f260c4a7f30>, _waiters=<collections.deque at remote 0x7f260c594730>) at remote 0x7f260c5101d0>, _num_groups=2, _group_member_counts=[0, 0]) at remote 0x7f260c510160>, _nodes_by_id={1: <Operation(_graph=<...>, _inputs_val=None, _id_value=1, _original_op=None, _traceback=<tensorflow_core.python._tf_stack.StackSummary at remote 0x7f260c510f48>, _device_code_locations=[<TraceableObject(obj='', filename='/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py', ...(truncated)) at ../Python/ceval.c:754 #79 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #80 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #81 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #82 0x000000000050c36e in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3351 #83 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x68693178, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py, line 331, in fit (self=<Loop at remote 0x7f260c5102b0>, model=<Model(_self_setattr_tracking=True, _nested_outputs=<Tensor(_op=<Operation(_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f262967f690>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lock at remote 0x7f260c4a7f30>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f260c4a7f30>, release=<built-in method release of _thread.lock object at remote 0x7f260c4a7f30>, _waiters=<collections.deque at remote 0x7f260c594730>) at remote 0x7f260c5101d0>, _num_groups=2, _group_member_counts=[0, 0]) at remote 0x7f260c510160>, _nodes_by_id={1: <Operation(_graph=<...>, _inputs_val=None, _id_value=1, _original_op=None, _traceback=<tensorflow_core.python._tf_stack.StackSummary at remote 0x7f260c510f48>, _device_code_locations=[<TraceableObject(obj='',filename='/usr/local/lib/python3.6/dist-packages/tensorflow_core/pytho...(truncated)) at ../Python/ceval.c:754 #84 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #85 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #86 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #87 0x000000000050c36e in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3351 #88 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7f20bc0086b8, for file /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py, line 766, in fit (self=<Model(_self_setattr_tracking=True, _nested_outputs=<Tensor(_op=<Operation(_graph=<FuncGraph(_lock=<_thread.RLock at remote 0x7f262967f690>, _group_lock=<GroupLock(_ready=<Condition(_lock=<_thread.lock at remote0x7f260c4a7f30>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f260c4a7f30>, release=<built-in method release of _thread.lock object at remote 0x7f260c4a7f30>, _waiters=<collections.deque at remote 0x7f260c594730>) at remote 0x7f260c5101d0>, _num_groups=2, _group_member_counts=[0, 0]) at remote 0x7f260c510160>, _nodes_by_id={1: <Operation(_graph=<...>, _inputs_val=None, _id_value=1, _original_op=None, _traceback=<tensorflow_core.python._tf_stack.StackSummary at remote 0x7f260c510f48>, _device_code_locations=[<TraceableObject(obj='', filename='/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py', lineno=390...(truncated)) at ../Python/ceval.c:754 ---Type <return> to continue, or q <return> to quit--- #89 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #90 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #91 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #92 0x000000000050c36e in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3351 #93 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x52a7658, for file /user/vmarkovtsev/images/hang.py, line 31, in main (sample=<tensorflow.python.framework.ops.EagerTensor at remote 0x7f26295f78d0>, ds_train=<MapDataset(_input_dataset=<BatchDataset(_input_dataset=<RepeatDataset(_input_dataset=<MapDataset(_input_dataset=<TensorDataset(_structure=<TensorSpec at remote 0x7f26295ffe10>, _tensors=[<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d514438>], _variant_tensor_attr=<tensorflow.python.framework.ops.EagerTensor at remote 0x7f263d5148d0>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[<TrackableReference at remote 0x7f26295ffd80>], _self_unconditional_dependency_names={'_variant_tracker': <_VariantTracker(_resource_handle=<...>, _resource_device='CPU', _resource_deleter=<CapturableResourceDeleter(_destroy_resource=None)at remote 0x7f263afb4400>, _create_resource=<function at remote 0x7f263bb23620>, _self_setattr_tracking=True, _self_unconditional_checkpoint_dependencies=[], _self_unconditional_dependency_n...(truncated)) at ../Python/ceval.c:754 #94 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #95 0x0000000000508fa0 in fast_function.lto_priv () at ../Python/ceval.c:4992 #96 0x000000000050999d in call_function.lto_priv () at ../Python/ceval.c:4872 #97 0x000000000050b4a9 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335 #98 0x0000000000507125 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x20509a8, for file /user/vmarkovtsev/images/hang.py, line 35, in <module> ()) at ../Python/ceval.c:754 #99 _PyEval_EvalCodeWithName.lto_priv.1821 () at ../Python/ceval.c:4166 #100 0x000000000050a3b3 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, args=0x0, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187 #101 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:731 #102 0x00000000006349e2 in run_mod () at ../Python/pythonrun.c:1025 #103 0x0000000000634a97 in PyRun_FileExFlags () at ../Python/pythonrun.c:978 #104 0x000000000063824f in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:419 #105 0x0000000000638425 in PyRun_AnyFileExFlags () at ../Python/pythonrun.c:81 #106 0x0000000000638df1 in run_file (p_cf=0x7ffcaa45361c, filename=<optimized out>, fp=<optimized out>) at ../Modules/main.c:340 #107 Py_Main () at ../Modules/main.c:810 #108 0x00000000004b0de0 in main (argc=2, argv=0x7ffcaa453818) at ../Programs/python.c:69 </pre> </details>

<details> <summary><code>bt</code> of each of the 4 running threads</summary> <pre> #0 0x00007fa23e7989d0 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fa1ec03cffd in tensorflow::(anonymous namespace)::PosixEnv::SleepForMicroseconds(long long) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2 #2 0x00007fa1f5d2dcd5 in tensorflow::EventMgr::PollLoop() () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so #3 0x00007fa1ec0528d1 in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2 #4 0x00007fa1ec04feb8 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2 #5 0x00007fa1ec6a58df in std::execute_native_thread_routine (__p=0x6360ed0) at /dt7-src/libstdc++-v3/src/nonshared11/../c++11/thread.cc:83 #6 0x00007fa23e49c6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #7 0x00007fa23e7d588f in clone () from /lib/x86_64-linux-gnu/libc.so.6 </pre> </details>

Speculation

As we see, there are 4 threads - I guess one for each of my GPUs - which are polling something. They make 25-30% CPU load together. There are more than a hundred other threads, so I don't know which ones I should bt additionally. I tried with different batch sizes, which ofc influences the memory consumption, but does not change anything with the hang.

I can provide the access to the hardware or execute arbitrary commands if needed.

closed time in 5 months

vmarkovtsev

issue commenttensorflow/tensorflow

tf.distribute.MirroredStrategy leads to an infinite polling cycle with 4 GPUs

Thanks, please reopen if needed.

vmarkovtsev

comment created time in 5 months

issue commenttensorflow/tensorflow

Send/Recv of collective_ops hangs in a distributed environment

Thanks for the report. I'll look into this a bit more and get back to you.

whhu

comment created time in 5 months

issue commenttensorflow/tensorflow

All_reduce of collective_ops hangs in a distributed environment

Are you trying to do something like a hierarchical all-reduce?

I haven't tried this myself but one thing I see that confuses me is the assignment of group keys and instance keys. You want a unique group key for each set of devices participating in a collective, and you want a unique instance key for every instance of a collective.

whhu

comment created time in 5 months

issue commenttensorflow/tensorflow

Send/Recv of collective_ops hangs in a distributed environment

In this case it looks like you're actually creating 2 instances of broadcast collectives, but assigning both of them the same instance key. A group defines a set of devices participating in a collective op, an instance is a specific collective, like a single all-reduce or a single broadcast. In this case there is an instance in which task 0 sends and task 1 receives, and there is another instance in which task 0 receives and task 1 sends. They should have separate instance keys.

whhu

comment created time in 5 months

more