Ask questionsHang on out of memory error
You can collect some of this information using our environment capture
You can also obtain the TensorFlow version with: 1. TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0:
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior TensorFlow hangs when it hits out of memory after it dumps the out of memory message.
Describe the expected behavior TensorFlow should exit on non-zero return code on OOM.
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
import tensorflow as tf from tensorflow.keras import backend as K import numpy as np def random_image_generator(batch_size, num_classes, input_shape): templates = 2 * num_classes * np.random.random((num_classes,) + input_shape) random_data = np.random.normal(loc=0, scale=1., size=input_shape) while True: y = np.random.randint(0, num_classes, size=(batch_size,)) x = np.zeros((batch_size,) + input_shape, dtype=np.float32) for i in range(batch_size): x[i] = templates[y[i]] + random_data x_array = np.array(x) y_array = tf.keras.utils.to_categorical(y, num_classes) yield(x_array, y_array) def run_model(): K.set_image_data_format('channels_first') image_dim = 5000 input_shape = (3, image_dim, image_dim) num_classes = 15 batch_size = 1 model_class = tf.keras.applications.ResNet50 model = model_class(weights=None, include_top=True, input_shape=input_shape, classes=num_classes) model.compile(optimizer='rmsprop', loss='categorical_crossentropy') random_generator = random_image_generator(batch_size, num_classes, input_shape) model.fit(random_generator, steps_per_epoch=10, epochs=1) run_model()
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
This program hangs after dumping the out of memory error on 16GB and 32GB GPUs (P100 and V100 tested). The program use to exit on TensorFlow 1.15. This happens on both the 2.1.0 and nightly containers on Intel x86 systems.
I originally hit this on built-from-source TensorFlow 2.1.0 on ppc64le. On that system, I attached gdb and dumped the stacks. It seems the code is hanging on the three thread stacks noted in the attachment. threeThreadStacks.txt
Answer questions aaudiber
I don't think the issue is with
ParallelMapIterator - it was moved between 2.0.0 and 2.1.0, but it's always had the logic of waiting for outstanding calls to finish during deconstruction: https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/core/kernels/data/parallel_map_iterator.cc#L70-L79
From the stack trace
#6 0x00007fff462e9cd4 in tensorflow::condition_variable::wait #7 0x00007fff4114079c in tensorflow::data::InstantiatedCapturedFunction::RunWithBorrowedArgs #8 0x00007fff40d88d3c in tensorflow::data::GeneratorDatasetOp::Dataset::Iterator::GetNextInternal
it looks like we're getting stuck here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/captured_function.cc#L717-L721. We call
lib_->Run to invoke the python function, which is supposed to call
Notify() when the python function completes (whether or not it succeeds). For some reason it looks like that callback never happens. It isn't clear whether that's because the python function itself never completes, or because
lib_->Run fails to call
Notify on some error-handling code path. If I could reproduce, I would add additional logging to see what happens in