profile
viewpoint

jordwalke/flex 331

Reason CSS Flexbox implementation

jordwalke/esy-issues 49

Easy ISSUES

kainolophobia/news-speak 3

Social Chat

yunxing/Freewifi_in_CAN_PEK_PVG 3

generates authCodes to be used in Guangzhou Baiyun internation airport

yichengq/replay 2

The Reason Playground

yunxing/git-find-file.el- 2

Find files in a git project quickly. Sort them by last modified time

houjieth/ad_analyze 1

EECS 589 course project

issue commenttensorflow/tensorflow

[XLA] [TPU] It should not be possible to run out of vmem - please file a bug against XLA.

Thanks, YLaCoon!

The better report should have been submitted into tf-nightly a couple of days ago. Would be great if you can retry your bug and copy-paste the new report -- once we have it it should be much easier for us to reproduce on compiler side.

YLaCoon

comment created time in a month

issue commenttensorflow/tensorflow

[XLA] [TPU] It should not be possible to run out of vmem - please file a bug against XLA.

Sorry for the breakage. I don't think our generated report has enough info for us to reproduce and debug that. We have a PR to add more info in our OOM report system that will hopefully show up in the nightly version later this week. Perhaps then you can rerun your custom optimizer and we can reproduce the issue.

In the meanwhile, assigning this to @saberkun who may have a way to reproduce it internally, then our compiler team can debug it.

YLaCoon

comment created time in 2 months

issue commenttensorflow/tpu

Support for tf.where without x and y arguments

No problem. Just for clarification, is the segfault a separate bug?

aidangomez

comment created time in 4 months

issue closedtensorflow/tensorflow

[TF 2.0] Using keras.metrics in TPU training results in error

I am trying to train a BERT model from https://github.com/tensorflow/models/tree/master/official/nlp on TPU in google colab. I changed the metrics list passed to model in compile method to:

bert_model.compile(optimizer=optimizer, loss=loss_fn, metrics=get_metrics())

where get_metrics is a function which returns a list of metrics ("accuracy" and instance for Recall and Precision built in tensorflow.keras.metrics):

from tensorflow.keras.metrics import Recall, Precision

def get_metrics():
    return ["accuracy",
            Recall(),
            Precision(), ]

Training results in the following error (after one epoch ends, before validation statistics are displayed):

I1018 16:34:07.313311 140541208393600 remote.py:151] Entering into master device scope: /job:worker/replica:0/task:0
2019-10-18 16:34:07.359467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-18 16:34:07.465723: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-10-18 16:34:07.465842: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7b6f1b4d4089): /proc/driver/nvidia/version does not exist
2019-10-18 16:34:07.466260: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-18 16:34:07.472748: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-10-18 16:34:07.473076: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3172f40 executing computations on platform Host. Devices:
2019-10-18 16:34:07.473114: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-10-18 16:34:07.475920: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 10.29.203.98:8470}
2019-10-18 16:34:07.475955: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30501}
2019-10-18 16:34:07.476742: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:30501
2019-10-18 16:34:07.497844: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 10.29.203.98:8470}
2019-10-18 16:34:07.497905: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30501}
INFO:tensorflow:Initializing the TPU system: 10.29.203.98:8470
I1018 16:34:07.499603 140541208393600 tpu_strategy_util.py:70] Initializing the TPU system: 10.29.203.98:8470
INFO:tensorflow:Clearing out eager caches
I1018 16:34:15.119202 140541208393600 tpu_strategy_util.py:94] Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
I1018 16:34:15.121769 140541208393600 tpu_strategy_util.py:114] Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
I1018 16:34:15.128222 140541208393600 tpu_system_metadata.py:148] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I1018 16:34:15.128440 140541208393600 tpu_system_metadata.py:149] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I1018 16:34:15.129121 140541208393600 tpu_system_metadata.py:150] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I1018 16:34:15.129209 140541208393600 tpu_system_metadata.py:152] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1018 16:34:15.129295 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.129720 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1018 16:34:15.129811 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I1018 16:34:15.129892 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I1018 16:34:15.129969 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I1018 16:34:15.130045 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I1018 16:34:15.130121 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I1018 16:34:15.130197 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I1018 16:34:15.130281 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I1018 16:34:15.130358 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I1018 16:34:15.130436 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I1018 16:34:15.130511 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.130593 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.248266 140541208393600 train.py:212] Training using TF 2.0 Keras compile/fit API with distrubuted strategy.
WARNING:tensorflow:Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
W1018 16:35:33.236943 140541208393600 training_utils.py:1547] Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
Train on 129 steps, validate on 65 steps
Epoch 1/5
2019-10-18 16:38:03.018892: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2019-10-18 16:38:03.020371: E tensorflow/core/platform/default/device_tracer.cc:70] CUDA error: CUDA_ERROR_NO_DEVICE
  1/129 [..............................] - ETA: 5:12:28 - loss: 1.0083 - accuracy: 0.2031 - recall: 0.1719 - precision: 0.2619WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.610206). Check your callbacks.
W1018 16:38:06.456197 140541208393600 callbacks.py:244] Method (on_train_batch_end) is slow compared to the batch update (1.610206). Check your callbacks.
128/129 [============================>.] - ETA: 1s - loss: 0.5022 - accuracy: 0.7563 - recall: 0.5862 - precision: 0.81392019-10-18 16:38:45.271991: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 0
Additional GRPC error information:
{"created":"@1571416725.271891392","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 0","grpc_status":3}
2019-10-18 16:38:45.272429: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 1
Additional GRPC error information:
{"created":"@1571416725.272350919","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 1","grpc_status":3}
2019-10-18 16:38:45.272841: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 2
Additional GRPC error information:
{"created":"@1571416725.272756237","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 2","grpc_status":3}
2019-10-18 16:38:45.273165: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 3
Additional GRPC error information:
{"created":"@1571416725.273105048","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 3","grpc_status":3}
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 340, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 332, in main
    run_bert(strategy, input_meta_data)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 287, in run_bert
    use_keras_compile_fit=FLAGS.use_keras_compile_fit)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 226, in run_bert_classifier
    custom_callbacks=None)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 143, in run_keras_compile_fit
    callbacks=custom_callbacks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 685, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 439, in model_iteration
    steps_name='validation_steps')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 299, in model_iteration
    batch_outs = f(actual_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 878, in execution_function
    return [out.numpy() for out in distributed_function(input_fn)]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 526, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnimplementedError:  Compilation failure: Asked to propagate a dynamic dimension from hlo %tuple.5198 = (pred[], f32[4,2]{1,0}) tuple(pred[] %convert.5196, f32[4,2]{1,0} %add.5004), metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}@{1}@0 to hlo %conditional.5209 = (pred[]) conditional(pred[] %convert.5196, (pred[], f32[4,2]{1,0}) %tuple.5198, (pred[], f32[4,2]{1,0}) %tuple.5198), true_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_true_127733_const_0__.5199, false_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_false_127734_const_0__.5204, metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}, which is not implemented.
	TPU compilation failed
	 [[{{node tpu_compile_succeeded_assert/_6193329545322784681/_7}}]]
Additional GRPC error information:
{"created":"@1571416725.270929013","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":" Compilation failure: Asked to propagate a dynamic dimension from hlo %tuple.5198 = (pred[], f32[4,2]{1,0}) tuple(pred[] %convert.5196, f32[4,2]{1,0} %add.5004), metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}@{1}@0 to hlo %conditional.5209 = (pred[]) conditional(pred[] %convert.5196, (pred[], f32[4,2]{1,0}) %tuple.5198, (pred[], f32[4,2]{1,0}) %tuple.5198), true_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_true_127733_const_0__.5199, false_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_false_127734_const_0__.5204, metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}, which is not implemented.\n\tTPU compilation failed\n\t [[{{node tpu_compile_succeeded_assert/_6193329545322784681/_7}}]]","grpc_status":12} [Op:__inference_distributed_function_127913]

Function call stack:
distributed_function -> distributed_function

2019-10-18 16:38:53.401848: E tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc:72] Remote EagerContext with id 6450803200035565614 does not seem to exist.

With only "accuracy" returned it works well finishing all epochs. With custom metrics like:

def precision(y_true, y_pred):
    y_pred = tf.math.rint(y_pred)
    TP = tf.math.reduce_sum(y_pred * y_true)
    FP = tf.math.reduce_sum(y_pred * (1 - y_true))

    _precision = tf.math.divide(TP, (TP + FP + eps))
    return _precision

it works as well, but the values returned are not correct. I suppose this is happening because on the TPU there are X steps per loop computed and somehow (I didn't dig too much into it) messes up the output metric. I tried with builtin functions to verify the behavior but it resulted in the error previously mentioned.

Snippet of the training call (the function is called run_keras_compile_fit in the github link I provided and it can be found in bert/run_classifier.py with almost none custom code added):

    with strategy.scope():
        training_dataset = train_input_fn()
        evaluation_dataset = eval_input_fn()
        bert_model, sub_model = model_fn()
        optimizer = bert_model.optimizer

        if init_checkpoint:
            checkpoint = tf.train.Checkpoint(model=sub_model)
            checkpoint.restore(init_checkpoint).assert_existing_objects_matched()

        bert_model.compile(optimizer=optimizer, loss=loss_fn, metrics=get_metrics())

        summary_dir = os.path.join(model_dir, 'summaries')
        summary_callback = tf.keras.callbacks.TensorBoard(summary_dir)
        checkpoint_path = os.path.join(model_dir, 'checkpoint')
        checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
            checkpoint_path, save_weights_only=True, save_best_only=True, mode='min')

        if custom_callbacks is not None:
            custom_callbacks += [summary_callback, checkpoint_callback]
        else:
            custom_callbacks = [summary_callback, checkpoint_callback]

        bert_model.fit(
            x=training_dataset,
            validation_data=evaluation_dataset,
            steps_per_epoch=steps_per_epoch,
            epochs=epochs,
            validation_steps=eval_steps,
            callbacks=custom_callbacks)

        return bert_model

In colab I installed the stable release of tensorflow 2.0 as the nightly version doesn't work well with colab's TPU's for now. The keras metrics are supposed to work with TPUs or this is not yet a feature?

closed time in 4 months

georgealexandruvlad

issue commenttensorflow/tensorflow

[TF 2.0] Using keras.metrics in TPU training results in error

Missed the notification. This should be fixed in nightly releases, do you have access to those ? I remember we also have 1.x nightly release which should also include the fix. cc @rxsang who is more familiar with this than me.

georgealexandruvlad

comment created time in 4 months

issue commenttensorflow/tpu

Support for tf.where without x and y arguments

We only added the support a week ago, could you try tf nightly and let us know if it works?

aidangomez

comment created time in 4 months

issue commenttensorflow/tensorflow

[TF 2.0] Using keras.metrics in TPU training results in error

Hi ahmadsalim@, we should have already fixed this issue a while ago. Which tf version are you using?

georgealexandruvlad

comment created time in 4 months

issue commenttensorflow/tensorflow

[TF 2.0] Using keras.metrics in TPU training results in error

@georgealexandruvlad

Hi, sorry about the breakage. The internal version of this issue got routed to me yesterday and we should have a fix very soon today (at least on our nightly release).

The root cause is our compiler had trouble handling conditionals with dynamic shapes, where is introduced by "Assert" operation in Metric.

@rxsang also added an option to disable the dynamic shapes behavior, IIRC you can enable that by setting strategy.experimental_enable_dynamic_batch_size = False

georgealexandruvlad

comment created time in 4 months

issue commenttensorflow/tensorflow

Combo TPU/TFRecords for model.evaluate is not working

This should have been fixed in our internal branch. @rxsang is there a way to disable dynamic shape for this case?

anhmeow

comment created time in 5 months

more