profile
viewpoint
Kaixi Hou houtoms @NVIDIA Santa Clara https://houtoms.github.io/ Deep learning, GPUs, High performance computing

houtoms/gpu_unified_cache 2

GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs

houtoms/benchmarking_cmp_op 0

Benchmarking comparison operations of different data types, i.e., int, float, and double

houtoms/config_files 0

Commonly used config files in linux

houtoms/models 0

Models and examples built with TensorFlow

houtoms/tensorflow 0

An Open Source Machine Learning Framework for Everyone

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

@rodrigoruiz Thanks. By the way, was this log for the failed training on your machine?

PierrePivert

comment created time in 3 hours

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

@rodrigoruiz , I am not familiar with colab, but I just played with it and I can get the log with, for example: https://colab.research.google.com/drive/1ERNMTEAboXG4EE2nx2L-8BTXgyOq_pHm

PierrePivert

comment created time in 16 hours

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

Basically, the first env variable is to tell the cuDNN to generate the call logs and the second env variable specify where the logs should go.

In your case, maybe you don't have access to /tmp. So please try a local one that you do have access to.

PierrePivert

comment created time in a day

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

@rodrigoruiz Was any log file generated in /tmp?

PierrePivert

comment created time in a day

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

@rodrigoruiz Then, could you please try this

export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG="/tmp/cudnn_api_log.%i"

suggested here https://github.com/tensorflow/tensorflow/issues/34094#issuecomment-591088834

PierrePivert

comment created time in a day

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

@rodrigoruiz Did you by any chance try your code in ubuntu?

PierrePivert

comment created time in 3 days

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

@xuxingya , let me find a ubuntu with RTX2080ti. Is the failed case using the same code in https://github.com/tensorflow/tensorflow/issues/33924#issuecomment-586770425? What changes should I make to repro the issue?

PierrePivert

comment created time in 3 days

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

@rodrigoruiz Thx for providing the repro script. I just ran your script, but I didn't encounter any errors in my local machine (TF1/TF2+GV100) and Colab with TF1/TF2+GPU. Btw, Colab with TF1+GPU was extremely slow and I think it didn't use GPU. Logs from Colab + TF2 + GPU

Train on 3000 samples, validate on 3000 samples
3000/3000 [==============================] - 101s 34ms/sample - loss: 0.6947 - accuracy: 0.5007 - precision_3: 0.4962 - recall_3: 0.4382 - val_loss: 0.6927 - val_accuracy: 0.5380 - val_precision_3: 0.5244 - val_recall_3: 0.7379

Logs from Colab + TF1 + GPU

The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.
We recommend you upgrade now or ensure your notebook will continue to use TensorFlow 1.x via the %tensorflow_version 1.x magic: more info.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/init_ops.py:97: calling GlorotUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/init_ops.py:97: calling Orthogonal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/init_ops.py:97: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 3000 samples, validate on 3000 samples
3000/3000 [==============================] - 2289s 763ms/sample - loss: 0.6942 - acc: 0.5060 - precision: 0.5048 - recall: 0.6596 - val_loss: 0.6943 - val_acc: 0.4997 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00
PierrePivert

comment created time in 9 days

startedwuye9036/CppTemplateTutorial

started time in 12 days

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

It looks you have the correct cudnn version. @rodrigoruiz Can you send me a repro?

PierrePivert

comment created time in 14 days

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

Can you find your cudnn.h file? The version should be found there.

I'm not sure how to check the cuDNN version, but my CUDA version is 10.1.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:12:52_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.1, V10.1.243

By checking the actual dll, I see Product version: 6.14.11.10010, which is the latest from https://developer.nvidia.com/rdp/cudnn-download.

Can you find your cudnn.h file? The version should be found there.

PierrePivert

comment created time in 15 days

issue commenttensorflow/tensorflow

slim.separable_conv2d is too slow

Yes, we follow the https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_763.html#rel_763 to enable fast depwise cuDNN paths.

BKZero

comment created time in 16 days

pull request commenttensorflow/tensorflow

Support CUDNN depthwise convolution

@byronyi Yes, I think this is intended. As https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_763.html#rel_763 mentions, only the wgrad is supported when NHWC is used. In the future cuDNN release, there will be more faster NHWC kernels and then the conditions would be relaxed.

houtoms

comment created time in 16 days

issue commenttensorflow/tensorflow

Bidirectional LSTM fail on TF2.0

Can you have a try of the newer cuDNN? It seems the cuDNN 7.6.5 fixed a related issue: Fixed a lack-of-synchronization issue when cudnnRNNBackwardData() and cudnnRNNBackwardDataEx() calls a kernel that is not synchronized back to the application's stream. This issue only appears when users are using bidirectional RNN using algo of CUDNN_RNN_ALGO_STANDARD. This issue affects cuDNN versions 5 through 7.6.4.

https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_765.html#rel_765

PierrePivert

comment created time in 16 days

issue closedtensorflow/tensorflow

Conv2DTranspose shape gets None when exporting SavedModel

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): TF2.1
  • Python version: 3.6
  • Bazel version (if compiling from source): NA
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: 10.2/7.6
  • GPU model and memory: V100, 32Gb

Describe the current behavior The shape information after Conv2DTranspose layer is incomplete and the spatial dimensions are missing when exporting the saved model.

Describe the expected behavior The shape information should include the spatial dimensions when exporting the saved model.

Code to reproduce the issue

import tensorflow as tf

def _crop_and_concat(inputs, residual_input):
  factor = inputs.get_shape().dims[1].value / residual_input.get_shape().dims[1].value
  return tf.concat([inputs, tf.image.central_crop(residual_input, factor)], axis=-1)

class UNet(tf.keras.Model):
  def __init__(self, name):
    super(UNet, self).__init__(name)
    self.conv1 = tf.keras.layers.Conv2D(filters=8,
                                        kernel_size=(3, 3),
                                        activation=tf.nn.relu)
    self.conv2 = tf.keras.layers.Conv2D(filters=8,
                                        kernel_size=(3, 3),
                                        activation=tf.nn.relu)
    self.maxpool = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                                             strides=2)
    self.deconv = tf.keras.layers.Conv2DTranspose(filters=16,
                                                  kernel_size=(2, 2),
                                                  strides=(2, 2),
                                                  padding='same',
                                                  activation=tf.nn.relu)
    self.conv3 = tf.keras.layers.Conv2D(filters=8,
                                        kernel_size=(3, 3),
                                        activation=tf.nn.relu)

  @tf.function
  def call(self, x):
    print(">>> Input Shape", x.shape)
    out = self.conv1(x)
    print(">>> conv1 Shape", out.shape)
    skip = self.conv2(out)
    print(">>> conv2 Shape", skip.shape)
    out = self.maxpool(skip)
    print(">>> maxpool Shape", out.shape)
    out = self.deconv(out)
    # the deconv shape will be (None, None, None, 16) when exporting saved model
    print(">>> deconv Shape", out.shape)
    out = self.conv3(out)
    out = _crop_and_concat(out, skip)

    return out


model = UNet("dummy")

res = model.predict(tf.ones((1, 400, 400, 1)))

print("Finish prediction")

tf.keras.models.save_model(model, "/results/SavedModel",
    save_format="tf", overwrite=True, include_optimizer=False)

Other info / logs

>>> Input Shape (None, 400, 400, 1)
>>> conv1 Shape (None, 398, 398, 8)
>>> conv2 Shape (None, 396, 396, 8)
>>> maxpool Shape (None, 198, 198, 8)
>>> deconv Shape (None, 396, 396, 16)
>>> Input Shape (None, 400, 400, 1)
>>> conv1 Shape (None, 398, 398, 8)
>>> conv2 Shape (None, 396, 396, 8)
>>> maxpool Shape (None, 198, 198, 8)
>>> deconv Shape (None, 396, 396, 16)
Finish prediction
>>> Input Shape (None, 400, 400, 1)
>>> conv1 Shape (None, 398, 398, 8)
>>> conv2 Shape (None, 396, 396, 8)
>>> maxpool Shape (None, 198, 198, 8)
>>> deconv Shape (None, **None, None**, 16)
... <Then the TF crashes and complains about the None value since we need it to compute the fraction in _crop_and_concat()> ...

The above log shows that in the prediction, all shapes are correctly inferred. But when we are exporting the saved model: the shape becomes incomplete and the spatial info are lost only after the deconv (Conv2DTranspose) and all the other layers still looks fine with correct shape info. So, we have two questions: (1) Why do we need to calculate the shapes again when exporting the saved model? (2) Why is the spatial info lost only after deconv? And this one looks like a bug.

FYI @nluehr

closed time in 23 days

houtoms

issue commenttensorflow/tensorflow

Conv2DTranspose shape gets None when exporting SavedModel

Thx. Closing.

houtoms

comment created time in 23 days

issue commenttensorflow/tensorflow

Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

Sure. Let me check with the CUDNN team and will update soon.

mimxrt

comment created time in a month

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

Thx for the update.

houtoms

comment created time in a month

issue commenttensorflow/tensorflow

Conv2DTranspose shape gets None when exporting SavedModel

@jvishnuvardhan Thx for the update. After reading the doc here https://www.tensorflow.org/guide/concrete_function,

Note: tf.saved_model retraces all concrete_functions when saving them. This is to ensure that the exported concrete functions capture changes in the environment on export (e.g. distribution strategy scope).

I think my first question in the description is answered. But the issue of missing spatial info after deconv is still on.

houtoms

comment created time in a month

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

Anything else I can do? Thx.

houtoms

comment created time in a month

issue openedtensorflow/tensorflow

Conv2DTranspose shape gets None when exporting SavedModel

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): TF2.1
  • Python version: 3.6
  • Bazel version (if compiling from source): NA
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: 10.2/7.6
  • GPU model and memory: V100, 32Gb

Describe the current behavior The shape information after Conv2DTranspose layer is incomplete and the spatial dimensions are missing when exporting the saved model.

Describe the expected behavior The shape information should include the spatial dimensions when exporting the saved model.

Code to reproduce the issue

import tensorflow as tf

def _crop_and_concat(inputs, residual_input):
  factor = inputs.get_shape().dims[1].value / residual_input.get_shape().dims[1].value
  return tf.concat([inputs, tf.image.central_crop(residual_input, factor)], axis=-1)

class UNet(tf.keras.Model):
  def __init__(self, name):
    super(UNet, self).__init__(name)
    self.conv1 = tf.keras.layers.Conv2D(filters=8,
                                        kernel_size=(3, 3),
                                        activation=tf.nn.relu)
    self.conv2 = tf.keras.layers.Conv2D(filters=8,
                                        kernel_size=(3, 3),
                                        activation=tf.nn.relu)
    self.maxpool = tf.keras.layers.MaxPool2D(pool_size=(2, 2),
                                             strides=2)
    self.deconv = tf.keras.layers.Conv2DTranspose(filters=16,
                                                  kernel_size=(2, 2),
                                                  strides=(2, 2),
                                                  padding='same',
                                                  activation=tf.nn.relu)
    self.conv3 = tf.keras.layers.Conv2D(filters=8,
                                        kernel_size=(3, 3),
                                        activation=tf.nn.relu)

  @tf.function
  def call(self, x):
    print(">>> Input Shape", x.shape)
    out = self.conv1(x)
    print(">>> conv1 Shape", out.shape)
    skip = self.conv2(out)
    print(">>> conv2 Shape", skip.shape)
    out = self.maxpool(skip)
    print(">>> maxpool Shape", out.shape)
    out = self.deconv(out)
    # the deconv shape will be (None, None, None, 16) when exporting saved model
    print(">>> deconv Shape", out.shape)
    out = self.conv3(out)
    out = _crop_and_concat(out, skip)

    return out


model = UNet("dummy")

res = model.predict(tf.ones((1, 400, 400, 1)))

print("Finish prediction")

tf.keras.models.save_model(model, "/results/SavedModel",
    save_format="tf", overwrite=True, include_optimizer=False)

Other info / logs

>>> Input Shape (None, 400, 400, 1)
>>> conv1 Shape (None, 398, 398, 8)
>>> conv2 Shape (None, 396, 396, 8)
>>> maxpool Shape (None, 198, 198, 8)
>>> deconv Shape (None, 396, 396, 16)
>>> Input Shape (None, 400, 400, 1)
>>> conv1 Shape (None, 398, 398, 8)
>>> conv2 Shape (None, 396, 396, 8)
>>> maxpool Shape (None, 198, 198, 8)
>>> deconv Shape (None, 396, 396, 16)
Finish prediction
>>> Input Shape (None, 400, 400, 1)
>>> conv1 Shape (None, 398, 398, 8)
>>> conv2 Shape (None, 396, 396, 8)
>>> maxpool Shape (None, 198, 198, 8)
>>> deconv Shape (None, None, None, 16)
... <Then the TF crashes and complains about the None value since we need it to compute the fraction in _crop_and_concat()> ...

The above log shows that in the prediction, all shapes are correctly inferred. But when we are exporting the saved model: the shape becomes incomplete and the spatial info are lost only after the deconv (Conv2DTranspose) and all the other layers still looks fine with correct shape info. So, we have two questions: (1) Why do we need to calculate the shapes again when exporting the saved model? (2) Why is the spatial info lost only after deconv? And this one looks like a bug.

FYI @nluehr

created time in a month

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 tf_module {     name: "cross"

Done. PTAL.

houtoms

comment created time in 2 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 195729df0bdd087f611bc9f4b18cc769e96b9b4e

update goldens

view details

push time in 2 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 from tensorflow.python.util import nest from tensorflow.python.util.tf_export import tf_export +import os

Sure. Removed this unused import.

houtoms

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

+op {+  graph_op_name: "CTCLossV2"

Sure. Done.

houtoms

comment created time in 2 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha bb87219fdf1b52ed72d84f1f33ad653164bee065

Set CTCLossV2 to visibility:HIDDEN

view details

push time in 2 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha f9e38a46fc7600e5f188d64b821e6bb00fdde1b5

remove unused import

view details

push time in 2 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@sanjoy @alextp , I have replaced the previous environment variable with the implementation selector (Thx @qlzh727 for helping me out with some test cases.) Now, we don't need the env var to control if cuDNN is used or not. The runtime can automatically determine that if GPU is available or not.

I added another python function to contain this new implement (ie. ctc_loss_v3), which is only available in TF2.

Please help me find the reviewers to review this part. Thx.

houtoms

comment created time in 2 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 7ee06aa135a58d331676cf80dd97e61472fbf129

Add impl selector to remove the env var about the CUDNN CTC Loss

view details

push time in 2 months

PR closed tensorflow/tensorflow

Enable direct dilated convolution via CUDNN cla: yes comp:ops size:M stalled stat:awaiting response

For the dilated convolution, TF will perform space-to-batch transforms and then call the non-atrous convolution kernels.

This PR removes the transforms and directly call the atrous convolution kernels from CUDNN.

fyi. @nluehr

+140 -11

14 comments

1 changed file

houtoms

pr closed time in 2 months

pull request commenttensorflow/tensorflow

Enable direct dilated convolution via CUDNN

Since the performance didn't materialize, let's close this PR.

houtoms

comment created time in 2 months

create barnchhoutoms/tensorflow

branch : pr_cudnn_ctc_loss_test_impl_sel2

created branch time in 2 months

Pull request review commenttensorflow/models

Enable Persist BatchNorm in CTL ResNet50

 def run(flags_obj):         'mixed_bfloat16')     tf.compat.v2.keras.mixed_precision.experimental.set_policy(policy) +  common.set_cudnn_batchnorm_mode()

Sure. Done. PTAL.

houtoms

comment created time in 2 months

push eventhoutoms/models

Kaixi Hou

commit sha 0d1fb4cfd0d0f69022eba356f81395a6dce3623a

Add more comments

view details

push time in 2 months

PR opened tensorflow/models

Reviewers
Enable Persist BatchNorm in CTL ResNet50

As in the Keras compile/fit mode (https://github.com/tensorflow/models/blob/master/official/vision/image_classification/resnet_imagenet_main.py#L62), this PR enables the persistent CUDNN batch norm by default for better performance in CTL ResNet50.

fyi @nluehr

+2 -0

0 comment

1 changed file

pr created time in 2 months

create barnchhoutoms/models

branch : ctl_add_persist_BN

created branch time in 2 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@sanjoy Thanks for pointing out. Removed that empty class. PTAL.

houtoms

comment created time in 2 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha ecdaf8e5a598faacd5d6f1ed7d97366d5a054437

remove empty CtcLossDescriptor in dnn.h

view details

push time in 2 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 def ctc_loss_v2(labels,       raise ValueError(           "blank_index must be given when using SparseTensor labels.") +    _ctc_use_cudnn = os.environ.get("TF_CUDNN_CTC_LOSS", "0")

Hi @alextp , I made an implementation selector version internally but it didn't work. My guess is the mechanism might not work well with the loss function. Contacted with @qlzh727 (the owner of implementation selector) already.

houtoms

comment created time in 2 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

The conflict is solved. PTAL. Thx.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Mihai Maruseac

commit sha a36c12b7c70e750ef4bb8828855c3675e9519dc9

Add hash table resource implementation into TFLite PiperOrigin-RevId: 281997222 Change-Id: I8680e454fd4de99d4a2d51ef651df97d52253f0b

view details

Davide Libenzi

commit sha 44d09fca5a853b2c543ed6c9c7f3ebb755b2de43

Reduce Shape object creations (and conversions from proto) in XlaBuilder. PiperOrigin-RevId: 281997626 Change-Id: I4fa6ce89ca42dd4a2ec451f10c2ebd9224b33867

view details

Lucy Fox

commit sha 366c6e04f57230fc554d5c7d2691ac79d305ba00

Small formatting fix in Tutorial Ch2. PiperOrigin-RevId: 281998069 Change-Id: I1cf342f204299b9fae4a73a059507e4e15cce00a

view details

Brian Zhao

commit sha f635dd19e4892f88f8b37cba8c5c604b1dd446f7

Automated g4 rollback of changelist 281835274. PiperOrigin-RevId: 281998143 Change-Id: I27c047173e3fb6dc480e03037777acf86b0b1a64

view details

Yanan Cao

commit sha 2f889d7b84128a57452138c48b9df8b9465e4b33

Support resource types in CastCompatible checks. PiperOrigin-RevId: 281999124 Change-Id: Ib3a9749114e8e5c5463c25e9f6618e4d811d1449

view details

Sean Silva

commit sha f44a805fc5b6f7d3408e9912e8d3e704df6e7dde

tf_saved_model: Disallow duplicate bound inputs. This is a useful invariant because (together with a local check that resource Value's are not passed twice in the same argument list to e.g. a called function) guarantees that resource variables don't alias in a module with tf_saved_model semantics. PiperOrigin-RevId: 282003375 Change-Id: I7ba0dbda9a6ee3c734b4503fc7f68b09b505a758

view details

Mihai Maruseac

commit sha 0332dbab7a3555c8e0f19c960afca2f7b3c6ff60

Recursively create and delete directories under POSIX modular filesystem. We also provide tests to make sure all API requirements are satisfied. Just a small sized part of work for modular filesystem plugins. For more details, consult the RFC at https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md PiperOrigin-RevId: 282004206 Change-Id: I5256fe6fabd6ac85844437833c51b27f7cf92d81

view details

Jacob Burnim

commit sha 34c7bed9f6f88b3599cbc69df8c06fd210374edb

Rollback: Avoid FindFunctionDef lookup PiperOrigin-RevId: 282004381 Change-Id: If0caa639bd059b5ced8c77b3d49e5e92aa565efe

view details

Yanan Cao

commit sha 4021a5a86e9ce03ce21df0bf6388a1ee179c665f

Run constant folding after shape inference during tf->xla legalization to maximize chance of lowering ops that have compile-time constant operand requirement. PiperOrigin-RevId: 282005507 Change-Id: I811780560268a69c8f065783cad1b091b0b2e92c

view details

Brian Zhao

commit sha 05cdd2e0ed370ed778443133b8cf06a67fc4d851

Update LLVM version, since MLIR's change https://github.com/tensorflow/tensorflow/commit/f1d30f5e8d30096951f8e2066ce74813c5519dfe breaks the build. PiperOrigin-RevId: 282007263 Change-Id: Ibc5c20139443bc52c08e81658f19445fc3979519

view details

Thomas O'Malley

commit sha f4fb3007edcf8206fe75965f0d6cc18b3d343893

Fix lazy load of training v1 in SavedModel load. PiperOrigin-RevId: 282008023 Change-Id: I66d8e0d2987c0eaef48273d2ac345c309cd80329

view details

A. Unique TensorFlower

commit sha 4b642cefe8001aa2ad8706130eece641ef1528be

Add more canonicalizations for SubViewOp. Depending on which of the offsets, sizes, or strides are constant, the subview op can be canonicalized in different ways. Add such canonicalizations, which generalize the existing approach of canonicalizing subview op only if all of offsets, sizes and shapes are constants. PiperOrigin-RevId: 282010703 Change-Id: I9d46e37d9484d34c5e2605e4351c196addb856cc

view details

Amit Patankar

commit sha 6cae11a063393fd93a2421ac3236c123de38d84e

Updated the RBE image hashes to upgrade the estimator version. PiperOrigin-RevId: 282011479 Change-Id: I2d7b2312a14be29c03b8a3b7477da20ef675d042

view details

Denis Khalikov

commit sha a2009504968511cab3445fb92bed6a2d10f10e5b

[spirv] Add a canonicalizer for `spirv::LogicalNotOp`. Add a canonicalizer for `spirv::LogicalNotOp`. Converts: * spv.LogicalNot(spv.IEqual(...)) -> spv.INotEqual(...) * spv.LogicalNot(spv.INotEqual(...)) -> spv.IEqual(...) * spv.LogicalNot(spv.LogicalEqual(...)) -> spv.LogicalNotEqual(...) * spv.LogicalNot(spv.LogicalNotEqual(...)) -> spv.LogicalEqual(...) Also moved the test for spv.IMul to arithemtic tests. Closes #256 COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/256 from denis0x0D:sandbox/canon_logical_not 76ab5787b2c777f948c8978db061d99e76453d44 PiperOrigin-RevId: 282012356 Change-Id: I60413fae31379a55a90093b23b810309810f3725

view details

A. Unique TensorFlower

commit sha 9200d7d738e5cdaf4629a00a3191b0df171e1d47

Allow tensor-like objects in _GetNdArray Otherwise, tensor-like objects that are not instances of tf.Tensor (e.g. tf.Variable) can't be used in array assertions. PiperOrigin-RevId: 282017077 Change-Id: I6c6c250b238644a7872884ff3e0a2322443d7bb8

view details

Kaixi Hou

commit sha 0545dafacd115aa4ce8a7b8ebc9ced0f4de23bc0

Specify the params in IsCudnnSupportedFilterSize

view details

Mihai Maruseac

commit sha 545d788f774cd7eb45be903ab0fbd6ee998731fa

Copy and rename files on POSIX modular filesystem. We also provide tests to make sure all API requirements are satisfied. As this needs to support android builds and those don't have sendfile, this requires additional changes and refactorings. There is a bazel build that causes us to have only one single BUILD target on the plugin side, but we can work around it for now. I'm also moving one helper from main posix_filesystem.cc file to posix_filesystem_helper.{h,cc} as it is better to have the main plugin file only contain the entry points. Just a small sized part of work for modular filesystem plugins. For more details, consult the RFC at https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md PiperOrigin-RevId: 282028560 Change-Id: I4431d91751a156279f21089e98e6ac0b6cbbacc0

view details

Henry Tan

commit sha 0861a5c0a9de1e270310119b1f6e7f5595c8637d

Add compiler/xla/python/tpu_driver/... as a presubmit build, to proactively catch any build issue. PiperOrigin-RevId: 282031293 Change-Id: I59d2f85a7cf53dba9dd5d15a12b6a2b7e95d6f71

view details

Mihai Maruseac

commit sha 7499cc4974ebf2aee567a76aaa21ae1f0f6e73ac

Last changes for modular POSIX filesystem: GetMatchingPaths and FlushCaches. We also provide tests to make sure all API requirements are satisfied. Just a small sized part of work for modular filesystem plugins. For more details, consult the RFC at https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md PiperOrigin-RevId: 282032540 Change-Id: I7c3615d044fc1ebf3c943150c5be9a4feff840c8

view details

Guangda Lai

commit sha e840aa5e286a5b57aeb92560a6374bf31f22fe99

Remove device memory check, since it's incorrect when the pointer is pointing to pinned host memory. Also, memcpy would fail if the pointer is invalid, so we don't need an additional check. Added a test for pinned host memory. PiperOrigin-RevId: 282036798 Change-Id: I6c0aab79a0e1ec1df9e2010e461d2ad8af8a1703

view details

push time in 3 months

issue commenttensorflow/tensorflow

Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

@jalilasadi , I repro the error using mimxrt's python script posted above on Oct 10. The script works fine in both eager or graph mode if Masking layer is removed. Also, the script breaks in both modes if Masking layer is added.

mimxrt

comment created time in 3 months

issue commenttensorflow/tensorflow

Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

Thanks for the feedback.

Yes, I think the no support of the fully masked sequences in cuDNN is the major issue. For now the cuDNN requires that "Each element in seqLengthArray must be greater than 0 but less than or equal to maxSeqLength" (https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnSetRNNDataDescriptor).

mimxrt

comment created time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@sanjoy Yes, I went though the code again and the empty class of CtcLossDescriptor in dnn.h seems to be unnecessary. Removed it. PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 0681377aabe0fdbcfdaca280e95659eab39dbf45

Remove the empty CtcLossDescriptor

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@sanjoy Done. PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha cb7e008708f207d540f5f52d341d894e9eb75c26

Avoid to register the ctc loss kernel when cudnn is older than 7.6.3

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

If you have a use case for always registering the kernel then I'm happy as well.

Yeah, we could do as you suggested. But I think it would bring more problems. What about the REGISTER_OP("CTCLossV2") in tensorflow/core/ops/ctc_ops.cc. And what happens in python when we call gen_ctc_ops.ctc_loss_v2. Do we also need to protect them? It sounds like every time when we have some new op, we need to protect all the related code from front end python API, kernel definition, all the way to the stream executor. (I think that is why in the RNN we only have the macro protect code only in stream executor.)

Also, I have never seen any example to protect the register-op codes using CUDNN_VERSION in the tensorflow/core/kernels (Please correct me if I am wrong.).

houtoms

comment created time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

Do you think we need to add cudnn version checking macros to protect this?

On Thu, Dec 5, 2019 at 12:25 PM Sanjoy Das notifications@github.com wrote:

@sanjoy https://github.com/sanjoy Thx for the review. More changes are made based on your comments. PTAL.

Should we register the kernel even if cuDNN is older than 7.6.3 (and the kernel is guaranteed to fail)?

If users explicitly set the env var TF_CUDNN_CTC_LOSS and their cuDNN is < 7.6.3, it will error out from the stream exector during runtime (as we did for the RNN: if users set the variable sequence length params and their cuDNN < 7.2.1, we will error out. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_dnn.cc#L1729 ).

Let me rephrase: what is the use case for even registering a kernel that will always fail during runtime? Since that's what will happen if cuDNN is older than 7.6.3 right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/pull/32302?email_source=notifications&email_token=AA6Q5EAZ3DAU7Y7JI6QQEO3QXFPUNA5CNFSM4IUN3CM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGCA5NA#issuecomment-562302644, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6Q5EAPDXFS4XIIWKALALTQXFPUNANCNFSM4IUN3CMQ .

houtoms

comment created time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@sanjoy Thx for the review. More changes are made based on your comments. PTAL.

Should we register the kernel even if cuDNN is older than 7.6.3 (and the kernel is guaranteed to fail)?

If users explicitly set the env var TF_CUDNN_CTC_LOSS and their cuDNN is < 7.6.3, it will error out from the stream exector during runtime (as we did for the RNN: if users set the variable sequence length params and their cuDNN < 7.2.1, we will error out. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/stream_executor/cuda/cuda_dnn.cc#L1729).

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 4a89f04615602478e0f69618d53c329bdbd48725

Use DnnScratchAllocator

view details

Kaixi Hou

commit sha cbf169cb4dd1d7980253a4917559e2a8e54fd941

Variables init and decl in one line; check attr in constructor; check bounds before converting int64 to int; and other minor changes

view details

Kaixi Hou

commit sha 9966ed2814c02e7936a06f6a45e4f1dae628b994

Remove empty lines

view details

push time in 3 months

issue commenttensorflow/models

ResNet 50 scaling problem in CTL mode of TF2.0 container

@nnigania My observation is the compile/fit uses pinned memory (dark green in my above comment) with overlapped compute. The CTL uses pageable memory (light green in my original issue description) with serialized data copy blocks.

houtoms

comment created time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@alextp

Why do we need an environment variable? The CPU/GPU preprocessing for the CTC loss is different. So I proposed the environment variable to let users to decide which one to use.

Why can't we treat this the same way as we treat cudnn lstm (using the implementation selector API)? Yes, I am also thinking of using the implementation selector you pointed to. We can define sth like:

func1 = preprocessing1+CPU_CTC_loss 
func2 = preprocessing2 + GPU_CTC_loss
call func1 and register func2
``` @qlzh727 Please correct me if I am wrong.
houtoms

comment created time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@timshen91 Thx for your comments. More changes are made accordingly. PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 0b9feecc74c4f6b7ad72de83de090e2805540160

Move the CtcLossDescriptor constructor/destructor back to the header Surface the scratch memory allocation to the ThenCtcLoss() Use the absl::Span as a pointer

view details

push time in 3 months

Pull request review commenttensorflow/tensorflow

Support CUDNN depthwise convolution

 class DepthwiseConv2dNativeOp : public BinaryOp<T> {     // For in_depth == 1 and grouped convolutions.     use_cudnn_ = CanUseCudnn() && std::is_same<Device, GPUDevice>::value;     cudnn_use_autotune_ = CudnnUseAutotune();-    use_cudnn_grouped_conv_ = false;     dtype_ = DataTypeToEnum<T>::value;+    // Use CuDNN grouped conv only when input/output is NCHW and float16(half).+    // See cudnn release note 7.6.3. (https://docs.nvidia.com/deeplearning/sdk/c+    // udnn-release-notes/rel_763.html#rel_763)+#ifdef CUDNN_VERSION >= 7603

You are right. Fixed.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha ef1df28aab09a7eb9d35b92f8addefdbcdea0a73

Change ifdef to if

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Support CUDNN depthwise convolution

The CPU tests fail in compilation and I think I need to put the CUDNN_VERSION back to macros, since the CPU version won't include the cudnn.h and cannot recognize CUDNN_VERSION. This commit should be able to fix those tests. @sanjoy

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 0ef3a5f9f2cc9b7dfff86b0163638efc09bd779c

Put CUDNN_VERSION to macros

view details

push time in 3 months

Pull request review commenttensorflow/tensorflow

Support CUDNN depthwise convolution

 class DepthwiseConv2dNativeBackpropInputOp : public OpKernel {     // If in_depth==1, this operation is just a standard convolution.     // Depthwise convolution is a special case of cuDNN's grouped convolution.     bool use_cudnn = use_cudnn_ && (in_depth == 1 || (use_cudnn_grouped_conv_ &&-        IsCudnnSupportedFilterSize(filter_rows, filter_cols)));+        IsCudnnSupportedFilterSize(filter_rows, filter_cols, in_depth,+                                   out_depth)));

Sure. Done. PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 0545dafacd115aa4ce8a7b8ebc9ced0f4de23bc0

Specify the params in IsCudnnSupportedFilterSize

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Support CUDNN depthwise convolution

@sanjoy Just get some feedback from CuDNN and CuDNN depwise convolution actually assumes the multiplier needs to be one. So, we need to make sure the in_depth == out_depth to trigger the fast CuDNN depwise conv path. I've made further changes, PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha e0635fb242a856fbe08662f9b0b3a3037b2801a7

Limit the multiplier to be 1 to trigger fast cuDNN depthwise conv path

view details

push time in 3 months

Pull request review commenttensorflow/tensorflow

Support CUDNN depthwise convolution

 FP16ConvMode CudnnConvComputeMode(); bool DebugCudnnRnn(); bool DebugCudnnRnnUseTensorOps(); int64 DebugCudnnRnnAlgo();++// Return true if the CuDNN depthwise convolution can be used. See cudnn release

Done. PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 8c033eb3cb4cec1439127c777fe2e3eb4d453ca2

Return -> Returns

view details

push time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 4fd5783f4f0a34140b776fb9653ca02d65c68cf9

update comments

view details

push time in 3 months

Pull request review commenttensorflow/tensorflow

Support CUDNN depthwise convolution

 FP16ConvMode CudnnConvComputeMode(); bool DebugCudnnRnn(); bool DebugCudnnRnnUseTensorOps(); int64 DebugCudnnRnnAlgo();++// Return true if the filter is 1x1, 3x3, 5x5 or 7x7. This function is used

Done. PTAL.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Support CUDNN depthwise convolution

 class DepthwiseConv2dNativeOp : public BinaryOp<T> {     // TODO(csigg): Have autotune decide if native is faster than cuDNN.     // If in_depth==1, this operation is just a standard convolution.     // Depthwise convolution is a special case of cuDNN's grouped convolution.-    bool use_cudnn = use_cudnn_ && (in_depth == 1 || use_cudnn_grouped_conv_);+    // Use CuDNN grouped conv when filter is 1x1, 3x3, 5x5, 7x7.+    // See cudnn release note 7.6.3. (https://docs.nvidia.com/deeplearning/sdk/c+    // udnn-release-notes/rel_763.html#rel_763)+    bool use_cudnn = use_cudnn_ && (in_depth == 1 || (use_cudnn_grouped_conv_ &&+        filter_rows == filter_cols && (filter_rows == 1 || filter_rows == 3 ||+        filter_rows == 5 || filter_rows == 7)));

Thanks for pointing out. Done. PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 80813e6ed87db3386157937d6fe66d6bf8dcaab6

Update comments for the function IsCudnnSupportedFilterSize

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Support CUDNN depthwise convolution

@sanjoy PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha c290d248ee6d4971a888581903625fcf4a761034

Put the filter checking functions IsCudnnSupportedFilterSize

view details

push time in 3 months

issue commenttensorflow/models

ResNet 50 scaling problem in CTL mode of TF2.0 container

Just profiled the compile/fit mode, I found the data copy is from the pinned mem and I can see it gets overlapped with compute kernels. But, in the CTL mode, the data copy is from the pageable mem and do you know any idea why? Thx.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Support CUDNN depthwise convolution

 class DepthwiseConv2dNativeOp : public BinaryOp<T> {     // TODO(csigg): Have autotune decide if native is faster than cuDNN.     // If in_depth==1, this operation is just a standard convolution.     // Depthwise convolution is a special case of cuDNN's grouped convolution.-    bool use_cudnn = use_cudnn_ && (in_depth == 1 || use_cudnn_grouped_conv_);+    // Use CuDNN grouped conv when filter is 1x1, 3x3, 5x5, 7x7.+    // See cudnn release note 7.6.3. (https://docs.nvidia.com/deeplearning/sdk/c+    // udnn-release-notes/rel_763.html#rel_763)+    bool use_cudnn = use_cudnn_ && (in_depth == 1 || (use_cudnn_grouped_conv_ &&+        filter_rows == filter_cols && (filter_rows == 1 || filter_rows == 3 ||+        filter_rows == 5 || filter_rows == 7)));

Do you think we need to put this function to tensorflow/core/util/use_cudnn.h and then both depthwise_conv_op.cc and depthwise_conv_grad_op.cc can share it? Or do we just duplicate it in both files?

houtoms

comment created time in 3 months

issue commenttensorflow/tensorflow

Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

Got some feedback from cuDNN and it sounds not complicated to do that. Already filed a request to them. For now, users still have to make sure they don't have such fully masked sequences in batch.

mimxrt

comment created time in 3 months

issue commenttensorflow/models

ResNet 50 scaling problem in CTL mode of TF2.0 container

I recently ran the compile/fit mode and can get ~7500 img/sec in TF2 containers. But still, for nightly build (TF2.1), the perf drops to ~6500 imgs/sec. Anyway, both of them are much better than CTL mode.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 66eed745e48c7c4b78f08bb857bb1b34f02b727e

Fixed a logic issue

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Support CUDNN depthwise convolution

@sanjoy Yes, I think we need to put those filter dim restrictions into the conditions. It would make things clearer. Codes are updated. PTAL.

Also for the autotune part, the TF_CUDNN_USE_AUTOTUNE controls the cudnn autotuning and is set to be true by default. And we can find that it turns on the autotune here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/depthwise_conv_op.cc#L298.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 1bc1c1da4023a98e1a77379464672b2c448d8c6f

Put restriction over the filter dims for depthwise convolutions

view details

push time in 3 months

issue commenttensorflow/models

ResNet 50 scaling problem in CTL mode of TF2.0 container

@zongweiz Thx. I just tried these two options but they can only help to improve the perf by ~200 imgs/sec for 8 GPUs.

@saberkun I think I am using the current master. I can see those lines your referred to have already been deleted. Or do you mean I need to add them back?

For the container, I have also tried the nightly build of tf2.1, the scaling becomes better, from 2519.26 (4GPU) to 4539.29 (8GPU). But I think we are expecting the perf to be around 6000 to 7000 for 8 GPUs.

houtoms

comment created time in 3 months

issue openedtensorflow/models

ResNet 50 scaling problem in CTL mode of TF2.0 container

System information

  • What is the top-level directory of the model you are using: official/vision/image_classification
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): TF2.0
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: CUDA 10.0 cuDNN 7.5
  • GPU model and memory: DGX1V100 8 GPUs 32GB
  • Exact command to reproduce:

Pull the container of tensorflow/tensorflow:2.0.0-gpu-py3

python -u resnet_ctl_imagenet_main.py --model_dir=Y5tmUO.dir \
--data_dir=/data/imagenet/train-val-tfrecord-480 \
--batch_size=2048 --num_gpus=8 --dtype=fp16 \
--train_epochs=1 --train_steps=400 --use_synthetic_data=false \
--enable_tensorboard=false --single_l2_loss_op=true --use_tf_function=true 

Describe the problem

The performance of ResNet 50 in the CTL mode scales pretty bad for 8 GPUs. The 4 GPUs can get ~2100 imgs/sec. The perf of 8 GPUs is close to 4 GPUs, only achieving ~2900 images/sec for fp16 inputs.

After some digging, I found the data loading could be the problem and it got serialized among GPUs. I observed that one thread might be taking care of multiple GPUs for data loading and for some reason only after one data loading is finished on one GPU, the thread will issue the next loading for another GPU. The snapshot of the profiling is below. The green parts are the data loading, while the blue ones are kernel computations. In comparison, the 4 GPU case has no such problem.

Source code / logs

8 GPUs: image

4 GPUs: image

created time in 3 months

issue commenttensorflow/tensorflow

Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

Yes, I agree with you. I am contacting the cuDNN about this issue. Will update when I get any feedback.

mimxrt

comment created time in 3 months

issue commenttensorflow/tensorflow

Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

Yes, your understanding is correct. And cuDNN doesn't support zero samples in batch for now.

I am not sure if it is possible/how to change the batch size during training using tf.dataset (@qlzh727 some tf.dataset experts?). Batch size 1 could work, but it might significantly affect the performance.

Or you could try to change the way you split your sequence into the batch, like using seq_len/batch_size as the current 'num_tsteps' rather than the fixed 144. But still you need to make sure the minimum seq_len >= batch_size.

mimxrt

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Support CUDNN depthwise convolution

 class DepthwiseConv2dNativeBackpropFilterOp : public OpKernel {     } else {       LOG(ERROR) << "Only half, float, and double are supported.";     }+    // Use CuDNN grouped conv (filter gradients) when input/output is+    // float16(half). See cudnn release note 7.6.3. (https://docs.nvidia.com/dee+    // plearning/sdk/cudnn-release-notes/rel_763.html#rel_763)+    use_cudnn_grouped_conv_ = CUDNN_VERSION >= 7603 && dtype_ == DT_HALF;

I think you are referring the grouped conv. But, this PR only focuses on the depthwise conv. For the depwise conv, we check the inputs layout, precision, and stride. See below.

Improved depth-wise convolution for forward, dgrad, and wgrad under the following conditions:
Algorithm is algo1
Tensor format for filter is NCHW (wgrad supports NHWC also)
Input and outputs are in FP16 and computation is in FP32
Filter size: 1x1, 3x3, 5x5, 7x7 (dgrad only supports stride 1)
Math type is CUDNN_DEFAULT_MATH
houtoms

comment created time in 3 months

issue commenttensorflow/tensorflow

Masking LSTM: OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM

Yes, I can repro the error. Turns out the cudnnSetRNNDataDescriptor gives the BAD PARAM error. From the log, I can see you want to process the input data as 256 (batch size) x 144 (time steps) x 130 (unit size). But, when using masks, I observe that some entire sequences are masked out. For example, there could be only 25 meaningful sequences (at least having one time step) out of the 256 sequences in one batch. In this case, it seems the cuDNN requires the batch size to be 25. @mimxrt Can you first confirm the scenario I mentioned is what you want to do?

mimxrt

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha deecd42ac78da6df22922ccf2908938fa3fe2372

Modified the macros to ompile with old cudnn version

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@timshen91 Thanks for your clarification. I followed the convolution example and further simplified the code. PTAL. Thx.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 33d4b5a31927fef2efc3de961b35b722f763d985

Formatting

view details

push time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 4ef99df1d67b5ed6d579cc8973b2af3187fe4591

Simplified the CtcLossDescriptor

view details

Kaixi Hou

commit sha 0ae92149a3ca7172a505c3d5dd798f58b0d673e0

Added ElementType and DeviceMemoryBase for CTC Loss

view details

push time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 class RnnDescriptor {   virtual ParamsRegions ParamsBiasRegions() const { return ParamsRegions(); } }; +// Specifies the CTC Loss computation.+//+// The user is responsible for releasing this descriptor when it is no longer+// in use. The destructor releases the underlying descriptors.+class CtcLossDescriptor {

I cannot follow the point here. I am following the style of the above RnnDescriptor example. My understanding is we have a device-agnostic base class XDescriptor in dnn.h, then we have specific CuDNNXDescriptor inheriting it in cuda_dnn.h/cc. I can see many other descriptors like RnnDescriptor or RnnSequenceTensorDescriptor did that.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 class DnnSupport {     return false;   } +  // Enqueue a CTC Loss operation onto the stream.+  //+  // Arguments:+  //  stream: pointer to the stream where this operation should be enqueued to.+  //  probs_desc: specifies the shape and the data layout of the input tensor.+  //  probs_data: the device memory region that contains the input tensor.+  //  labels_data: the device memory region that contains the labels_value+  //    tensor.+  //  labels_lengths_data: the device memory region that contains the+  //    labels_lengths tensor+  //  input_lengths_data: the device memory region that contains the seq_lengths+  //    tensor+  //  costs_data: the device memory region that contains the costs tensor.+  //  grads_desc: specifies the shape and the data layout of the grads tensor.+  //  grads_data: the device memory region that contains the grads tensor.+  //  ctc_loss_desc: a CTCLoss descriptor created by createCTCLossDescriptor.+  //  workspace_allocator: a memory allocator that creates the temporary+  //    workspace memory used by this operation. The caller is responsible for+  //    keeping the memory alive long enough for this operation, and recylces+  //    afterwards.+  virtual bool DoCtcLoss(Stream* stream,

For the scratch memory allocation, I follow the way used in DoBatchNormalizationForward/Backward and DoRNNForward/Backward. Shouldn't it be consistent with them? Do I need to generalize the element type in this PR, considering the current cuDNN only support float?

houtoms

comment created time in 3 months

pull request commenttensorflow/tensorflow

Support CUDNN depthwise convolution

Any updates on this? Fast CUDNN depthwise convolutions would be a great improvement for TF users.

Tx for the comment. We are working on it. For the auto mixed precision support, we might still need to wait for a more universal support from CUDNN and then we can enable it in AMP.

houtoms

comment created time in 3 months

pull request commenttensorflow/tensorflow

Support CUDNN depthwise convolution

@sanjoy Tx for the review. More changes are made. PTAL.

houtoms

comment created time in 3 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 4b0e5c1e8ee90362c0bb2083b61eed99eb651122

Simplifed the branches

view details

push time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

Can you address this comment from before?

This should replace the existing ctc loss for GPUs, right?

Yes, if the environment variable TF_CUDNN_CTC_LOSS is defined. The reason why I cannot make it as the default backend is that CPU and GPU require different data transpose before calling the CTC compute.

houtoms

comment created time in 3 months

pull request commenttensorflow/tensorflow

Add support to cuDNN CTC loss

@rmlarsen Tx for the commends. I have made several changes. PTAL.

houtoms

comment created time in 4 months

push eventhoutoms/tensorflow

Kaixi Hou

commit sha 46aa1ca2206cb792a6c7c42a70597272881e71a1

Put the reusable class CudnnAllocatorInTemp to a separate file

view details

push time in 4 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 REGISTER_OP("CTCLoss")       return Status::OK();     }); +REGISTER_OP("CTCLossV2")+    .Input("inputs: float")

No, CuDNN only support the float CTCLoss.

houtoms

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 struct PersistentRnnPlanDeleter {     CHECK_CUDNN_OK(cudnnDestroyPersistentRNNPlan(plan));   } };+#if CUDNN_VERSION >= 7601

But I think all the cudnn calls are inside this cuda_dnn.cc. I haven't seen any example to call the cudnn directly from the kernel.

houtoms

comment created time in 4 months

PR opened tensorflow/tensorflow

Support CUDNN depthwise convolution

This PR adds the CUDNN depthwise convolution as the default implementations of DepthwiseConv2dNative, DepthwiseConv2dNativeBackpropInput, DepthwiseConv2dNativeBackpropFilter.

CuDNN 7.6.3 improves the performance of some cases for the depthwise convolution. Details at: https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_763.html#rel_763

@nluehr

+94 -3

0 comment

3 changed files

pr created time in 4 months

create barnchhoutoms/tensorflow

branch : support-cudnn-depconv_2

created branch time in 4 months

more