profile
viewpoint

google/dimsum 131

Portable C++ SIMD library

google/nvidia_libs_test 21

Tests and benchmarks for cudnn (and in the future, other nvidia libraries)

timshen91/Evil 6

Yet another Scheme interpreter

timshen91/biast 3

a total redesigned blog engine.

timshen91/my_malloc 1

not efficient

push eventllvm/llvm-project

Tim Shen

commit sha b762bbd4c86806095a11dbe4d594059bd3fd5bc5

[MLIR] change NVVM.mma.sync to the most useful variant. Summary: the .row.col variant turns out to be the popular one, contrary to what I thought as .row.row. Since .row.col is so prevailing (as I inspect cuDNN's behavior), I'm going to remove the .row.row support here, which makes the patch a little bit easier. Reviewers: ftynse Subscribers: jholewinski, bixia, sanjoy.google, mehdi_amini, rriddle, jpienaar, burmako, shauheen, antiagainst, nicolasvasilache, arpith-jacob, mgester, lucyrfox, liufengdb, Joonsoo, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D74655

view details

push time in 9 days

push eventllvm/llvm-project

Tim Shen

commit sha f581e655ec3f34dcd704ffc9586bfb615a459942

[MLIR] Add std.assume_alignment op. Reviewers: ftynse, nicolasvasilache, andydavis1 Subscribers: bixia, sanjoy.google, mehdi_amini, rriddle, jpienaar, burmako, shauheen, antiagainst, arpith-jacob, mgester, lucyrfox, aartbik, liufengdb, Joonsoo, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D74378

view details

push time in 9 days

push eventllvm/llvm-project

Tim Shen

commit sha 3ccaac3cdd8f1e5b19e2da04be2ebbfc1fb9aa32

[mlir] Add MemRefTypeBuilder and refactor some MemRefType::get(). The refactored MemRefType::get() calls all intend to clone from another memref type, with some modifications. In fact, some calls dropped memory space during the cloning. Migrate them to the cloning API so that nothing gets dropped if they are not explicitly listed. It's close to NFC but not quite, as it helps with propagating memory spaces in some places. Differential Revision: https://reviews.llvm.org/D73296

view details

push time in a month

push eventllvm/llvm-project

Tim Shen

commit sha 7b771ed448487705237868f705da17b40c6bfe82

[APInt] Fix tests that had wrong assumption about sdivs with negative quotient. Reviewers: sanjoy Subscribers: bixia, dexonsmith, sanjoy.google, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D70156

view details

push time in a month

pull request commenttensorflow/tensorflow

Replace --xla_gpu_disable_autotune option with --xla_gpu_autotune_level

Can someone review this PR please?

Done. Sorry for the delay!

bas-aarts

comment created time in a month

Pull request review commenttensorflow/tensorflow

Replace --xla_gpu_disable_autotune option with --xla_gpu_autotune_level

 GpuConvAlgorithmPicker::PickBestAlgorithmNoCacheCuda(   const Shape& result_shape = instr->shape().tuple_shapes(0);   int64 rng_state = 0; -  const auto initialize_buffer = [&stream, &rng_state](+  const HloModuleConfig& hlo_module_config = instr->GetModule()->config();+  int32 conv_level = hlo_module_config.debug_options().xla_gpu_autotune_level();

Please add const to these variables.

bas-aarts

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Replace --xla_gpu_disable_autotune option with --xla_gpu_autotune_level

 GpuConvAlgorithmPicker::PickBestAlgorithmNoCacheCuda(   const Shape& result_shape = instr->shape().tuple_shapes(0);   int64 rng_state = 0; -  const auto initialize_buffer = [&stream, &rng_state](+  const HloModuleConfig& hlo_module_config = instr->GetModule()->config();+  int32 conv_level = hlo_module_config.debug_options().xla_gpu_autotune_level();

autotune_level

bas-aarts

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add multi-algorithm deterministic cuDNN convolutions

 limitations under the License.  #include "tensorflow/stream_executor/gpu/gpu_helpers.h" +namespace stream_executor {+namespace cuda {++// A helper function to decide whether to enable deterministic cuDNN+// functionality.+bool RequireCuDNNDeterminism();

If you're willing to take it, and it seems that you are, then I would much prefer to have a bug fix in place immediately, by querying the env var in multiple places (as specified by the current commits).

Yes, this is what I meant.

Please will you clarify what you mean by comments describing the migration path? Do you mean adding comments explaining the intention to migrate to tf.config.experimental.deterministic_ops and associated plumbing? If so, I would gladly add that.

Yes, plus verbal warning bits like "this is a temporary solution".

Also, are you happy with having the code defined by RequireCudnnDeterminism() replicated in three different places in the codebase, or would you prefer, as I would, for it to be defined in one place, such as tensorflow/core/common_runtime/gpu:gpu_determinism? Refactoring that would be an easy and quick change to make.

I prefer to duplicate them in several places all with the comment we talked above, also with "this code is duplicated, and should be in sync with [other files]". I don't expect frequent changes to these duplicates, so keeping them in sync shouldn't be too much work.

duncanriach

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add multi-algorithm deterministic cuDNN convolutions

 limitations under the License.  #include "tensorflow/stream_executor/gpu/gpu_helpers.h" +namespace stream_executor {+namespace cuda {++// A helper function to decide whether to enable deterministic cuDNN+// functionality.+bool RequireCuDNNDeterminism();

@duncanriach , thanks for looking at these solutions!

I don't have a sense of priority for this PR, so I'm happy to defer the call to you. Depending on the priority, it's fine by me either to hold on this PR (as you seem to suggest) or to query the env var in multiple places (your original commits).

duncanriach

comment created time in 2 months

issue closedtensorflow/tensorflow

tf-gpu==1.13.1 : 35% less batch size before OOM vs tf-gpu==1.11.0

System information

  • Windows 7
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): 1.11.0 , 1.13.1
  • Python version: 3.6.5
  • CUDA/cuDNN version: 9/7.1.4 , 10/7.4.1
  • GPU model and memory: GTX 1060 6GB

Describe the current behavior

I have standard AE network with pixel shuffler layer.

on tf.1.11.0-cuda 9 maximum batch size for my GTX 1060 6GB is 132

but after upgrade to tf.1.13.1-cuda 10 tf cannot handle same batch size it produces OOM error and maximum now 90 for my card.

Describe the expected behavior

expected not to downgrade performance when upgrading tensorflow

Code to reproduce the issue

import numpy as np
import tensorflow as tf
keras = tf.keras
KL = keras.layers
K = keras.backend

bgr_shape = (128, 128, 3)
#batch_size = 132 #max -tf.1.11.0-cuda 9
batch_size = 86 #max -tf.1.13.1-cuda 10
 
class PixelShuffler(keras.layers.Layer):
    def __init__(self, size=(2, 2), data_format=None, **kwargs):
        super(PixelShuffler, self).__init__(**kwargs)
        self.size = size

    def call(self, inputs):

        input_shape = K.int_shape(inputs)
        if len(input_shape) != 4:
            raise ValueError('Inputs should have rank ' +
                             str(4) +
                             '; Received input shape:', str(input_shape))


        batch_size, h, w, c = input_shape
        if batch_size is None:
            batch_size = -1
        rh, rw = self.size
        oh, ow = h * rh, w * rw
        oc = c // (rh * rw)

        out = K.reshape(inputs, (batch_size, h, w, rh, rw, oc))
        out = K.permute_dimensions(out, (0, 1, 3, 2, 4, 5))
        out = K.reshape(out, (batch_size, oh, ow, oc))
        return out

    def compute_output_shape(self, input_shape):

        if len(input_shape) != 4:
            raise ValueError('Inputs should have rank ' +
                             str(4) +
                             '; Received input shape:', str(input_shape))


        height = input_shape[1] * self.size[0] if input_shape[1] is not None else None
        width = input_shape[2] * self.size[1] if input_shape[2] is not None else None
        channels = input_shape[3] // self.size[0] // self.size[1]

        if channels * self.size[0] * self.size[1] != input_shape[3]:
            raise ValueError('channels of input and size are incompatible')

        return (input_shape[0],
                height,
                width,
                channels)

    def get_config(self):
        config = {'size': self.size}
        base_config = super(PixelShuffler, self).get_config()

        return dict(list(base_config.items()) + list(config.items()))
        
def upscale (dim):
    def func(x):
        return PixelShuffler()((KL.Conv2D(dim * 4, kernel_size=3, strides=1, padding='same')(x)))
    return func 
            
inp = KL.Input(bgr_shape)
x = inp
x = KL.Conv2D(128, 5, strides=2, padding='same')(x)
x = KL.Conv2D(256, 5, strides=2, padding='same')(x)
x = KL.Conv2D(512, 5, strides=2, padding='same')(x)
x = KL.Conv2D(1024, 5, strides=2, padding='same')(x)
x = KL.Dense(1024)(KL.Flatten()(x))
x = KL.Dense(8 * 8 * 1024)(x)
x = KL.Reshape((8, 8, 1024))(x)
x = upscale(512)(x)
x = upscale(256)(x)
x = upscale(128)(x)
x = upscale(64)(x)
x = KL.Conv2D(3, 5, strides=1, padding='same')(x)

model = keras.models.Model ([inp], [x])
model.compile(optimizer=keras.optimizers.Adam(lr=5e-5, beta_1=0.5, beta_2=0.999), loss='mae')

training_data = np.zeros ( (batch_size,128,128,3) )
loss = model.train_on_batch( [training_data], [training_data] )
print ("FINE")

Other info / logs

1] 1 Chunks of size 12032 totalling 11.8KiB
2019-02-28 19:45:23.516100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 19200 totalling 75.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 38400 totalling 150.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 262144 totalling 1.00MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 368640 totalling 360.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 1179648 totalling 4.50MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 3276800 totalling 15.63MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 4718592 totalling 18.00MiB
2019-02-28 19:45:23.520100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 3 Chunks of size 13107200 totalling 37.50MiB
2019-02-28 19:45:23.520100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17028352 totalling 16.24MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17694720 totalling 16.88MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17694976 totalling 16.88MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 3 Chunks of size 18874368 totalling 54.00MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 23592960 totalling 22.50MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 52428800 totalling 250.00MiB
2019-02-28 19:45:23.529100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 75497472 totalling 360.00MiB
2019-02-28 19:45:23.529100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 94371840 totalling 90.00MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 100362240 totalling 95.71MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 2 Chunks of size 188743680 totalling 360.00MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 194688000 totalling 185.67MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 12 Chunks of size 268435456 totalling 3.00GiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 552317184 totalling 526.73MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
5] Sum Total of in-use chunks: 5.02GiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
7] Stats:
Limit:                  5838622720
InUse:                  5393793792
MaxInUse:               5708028928
NumAllocs:                     434
MaxAllocSize:           1363673088

2019-02-28 19:45:23.531100: W tensorflow/core/common_runtime/bfc_allocator.cc:27
1] *****************************************************__**********_***********
**********************x
2019-02-28 19:45:23.531100: W tensorflow/core/framework/op_kernel.cc:1401] OP_RE
QUIRES failed at conv_grad_input_ops.cc:1054 : Resource exhausted: OOM when allo
cating tensor with shape[90,128,64,64] and type float on /job:localhost/replica:
0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "D:\DeepFaceLab\_internal\bin\DeepFaceLab\test.py", line 87, in <module>
    loss = model.train_on_batch( [training_data], [training_data] )
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\keras\e
ngine\training.py", line 1188, in train_on_batch
    outputs = self.train_function(ins)  # pylint: disable=not-callable
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\keras\b
ackend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\client\
session.py", line 1439, in __call__
    run_metadata_ptr)
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\framewo
rk\errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocat
ing tensor with shape[90,128,64,64] and type float on /job:localhost/replica:0/t
ask:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training/Adam/gradients/conv2d_1/Conv2D_grad/Conv2DBackpropInp
ut}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add repor
t_tensor_allocations_upon_oom to RunOptions for current allocation info.

closed time in 2 months

iperov

issue commenttensorflow/tensorflow

tf-gpu==1.13.1 : 35% less batch size before OOM vs tf-gpu==1.11.0

With tf2.0.0 and batch=132, I cannot reproduce the OOM with the garbage collector on. With GC off, I can still see the OOM.

I'll close the bug as GC seems to deallocate dead memory.

2020-01-03 14:18:31.905093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-03 14:18:31.913239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:d8:00.0
2020-01-03 14:18:31.913417: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-03 14:18:31.914810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-03 14:18:31.915919: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-03 14:18:31.916171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-03 14:18:31.917729: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-03 14:18:31.918760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-03 14:18:31.922199: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-03 14:18:31.922733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-03 14:18:31.922922: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-01-03 14:18:31.964210: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2020-01-03 14:18:31.976496: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x52556b0 executing computations on platform Host. Devices:
2020-01-03 14:18:31.976542: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-01-03 14:18:32.117365: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x52b8130 executing computations on platform CUDA. Devices:
2020-01-03 14:18:32.117430: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): TITAN V, Compute Capability 7.0
2020-01-03 14:18:32.118844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455
pciBusID: 0000:d8:00.0
2020-01-03 14:18:32.118913: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-03 14:18:32.118941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-03 14:18:32.118963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-03 14:18:32.118985: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-03 14:18:32.119006: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-03 14:18:32.119027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-03 14:18:32.119049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-03 14:18:32.120211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-03 14:18:32.120271: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-03 14:18:32.122202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-03 14:18:32.122230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-01-03 14:18:32.122244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-01-03 14:18:32.124553: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2020-01-03 14:18:32.124591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5537 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:d8:00.0, compute capability: 7.0)
2020-01-03 14:18:32.911533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-03 14:18:33.885741: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-01-03 14:18:34.006646: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.63GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:34.006681: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.63GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:34.385341: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2020-01-03 14:18:34.461297: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 3.43G (3678928896 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-03 14:18:34.557533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-03 14:18:35.221099: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.96GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:35.221135: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.96GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:35.641610: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.34GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:35.641645: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.34GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:35.708188: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.92GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:35.708215: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.92GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:35.834415: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.38GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-01-03 14:18:35.834441: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.38GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
FINE
iperov

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add multi-algorithm deterministic cuDNN convolutions

 limitations under the License.  #include "tensorflow/stream_executor/gpu/gpu_helpers.h" +namespace stream_executor {+namespace cuda {++// A helper function to decide whether to enable deterministic cuDNN+// functionality.+bool RequireCuDNNDeterminism();

Thanks so much for that detailed description, @timshen91. I can see the path to implementing that pretty clearly. I do have a concern with this approach though. A global initializer is required because there is no triggering event to set the internal state. I imagine that a C++ global initializer will run when TensorFlow is imported into Python. Do you know if that is true?

In general, global initializers run when the .so file is dlopen()ed, but I'm not sure when exactly.

If it's true that the C++ global initializer will run when TensorFlow is imported into Python, then it would be necessary to set TF_CUDNN_DETERMINISTIC or TF_DETERMINISTIC_OPS before importing TensorFlow. This would be quite a significant change in the API for current users, which would need to be conveyed for versions of TensorFlow beyond a certain point (e.g v2.2) and would also require user code changes (existing code would break).

The global initializer part was more of a straw-proposal, given that I don't know enough about tensorflow-determinism. How about using std::once in RequireDeterminism(), so that the initialization is defered until the first check?

Also, our kernel tests set these environment variables in python-main before starting the tests. I don't know if it's possible, or desirable, to set the environment variables prior to the import of TensorFlow. I don't know how to achieve that in the test files and/or the bazel build config (using currently available functionality).

The way it works currently in the master branch is that the first time the state of one of these environment variables is needed, it gets referenced and cached.

This sounds like the std::once behavior I described above.

The recommended, and most common, usage pattern is to set these environment variables using os.environ after importing TensorFlow and before building or training a graph. I've been trying to replicate this functionality even after the need to reference those environment variables has bled out of stream executor.

duncanriach

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add multi-algorithm deterministic cuDNN convolutions

 limitations under the License.  #include "tensorflow/stream_executor/gpu/gpu_helpers.h" +namespace stream_executor {+namespace cuda {++// A helper function to decide whether to enable deterministic cuDNN+// functionality.+bool RequireCuDNNDeterminism();

Hi @duncanriach ,

Thanks for the detailed survey and context of tensorflow-determinism.

I have been communicating the status and recipes for TensorFlow determinism here, and I don't want to make that information unnecessarily complicated for the user.

Agreed. I think we can achieve both "let users control determinism with a single env var" and "use multiple internal global states". I'll describe a revised proposal below.

I'm not sure what you mean by "a flag in TF." Perhaps you're referring to adding something to tf.config? Or are you just referring to what is represented by RequireCuDNNDeterminism() (the logical OR of TF_CUDNN_DETERMINISTIC with TF_DETERMINISTIC_OPS)?

For the context, I was referring to env vars like TF_CUDNN_DETERMINISTIC as "a flag in TF".

Ideal Solution It sounds plausible, with the following complement for a more long term solution, with hopefully reasonable implementation cost:

  • Move RequireCuDNNDeterminism() to tensorflow/common_runtime/gpu:gpu_determinism (already described).
  • For XLA, create a new DebugOption flag "xla_deterministic_autotuning" [1] and append [2] it from tensorflow/common_runtime/gpu:gpu_determinism, preferably in a global initializer.
  • For StreamExecutor, create a virtual void DnnSupport::SetRequireDeterminism(), and call it from the global initializer in gpu_determinism. SetRequireDeterminism() sets a bool member in CudnnSupport that turns on and off determinism. DnnSupport is a singleton and can be accessed from StreamExecutor::AsDnn().

[1] https://github.com/tensorflow/tensorflow/blob/cf8929973986d1eeac6acc10cb36b17075ca075e/tensorflow/compiler/xla/xla.proto#L25 [2] https://github.com/tensorflow/tensorflow/blob/96de19babd874ea6be7674520e0d8351ca89a749/tensorflow/compiler/xla/debug_options_flags.h#L29

duncanriach

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add multi-algorithm deterministic cuDNN convolutions

 limitations under the License.  #include "tensorflow/stream_executor/gpu/gpu_helpers.h" +namespace stream_executor {+namespace cuda {++// A helper function to decide whether to enable deterministic cuDNN+// functionality.+bool RequireCuDNNDeterminism();

It seems weird for TF and XLA to ask for flags from StreamExecutor, as StreamExecutor is just a library, and it should impose constraints on users about how to use this library.

Ideally, the source of truth of "users require determinism" should be a flag in TF, and then it get passed down to XLA. Both TF and XLA pass down the same boolean to StreamExecutor. But this ideal solution creates unnecessary amount of work.

How about this:

  • Create a TF flag TF_DETERMINISTIC_AUTOTUNING in tensorflow/core/kernel/gpu_utils.h, which is only used by TF kernels.
  • Create a similar XLA flag to DebugOptions in tensorflow/compiler/xla/xla.proto, which is used by XLA.
  • Keep the current StreamExecutor flags as it is privately in cuda_dnn.cc.

When the end user wants determinism, they have to compose all necessary flags. This way, at least the layering is less coupled.

duncanriach

comment created time in 3 months

push eventgoogle/nvidia_libs_test

Nobody

commit sha 501311177f2a296d8a65c25f98d16d6b307b6d24

Project import generated by Copybara. FolderOrigin-RevId: /google/src/cloud/timshen/nvidia/google3/.

view details

push time in 3 months

push eventgoogle/nvidia_libs_test

Nobody

commit sha 427e020ce00b84947b08f67f263f5ca43b44eb1b

Project import generated by Copybara. FolderOrigin-RevId: /google/src/cloud/timshen/clean2/google3/.

view details

push time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 string ConvolutionDescriptor::ToShortString() const {   return desc; } +// -- CtcLossDescriptor+//+CtcLossDescriptor::CtcLossDescriptor() {}

These can be defined in the header (or left omitted).

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 port::Status CudnnSupport::DoRnnBackwardImpl(   return port::Status::OK(); } +port::Status CudnnSupport::DoCtcLossImpl(+    Stream* stream, const CudnnRnnStateTensorDescriptor& probs_desc,+    const DeviceMemoryBase probs_data,+    const absl::Span<const int32>& labels_data,+    const absl::Span<const int32>& labels_lengths_data,+    const absl::Span<const int32>& input_lengths_data,+    DeviceMemoryBase costs_data,+    const CudnnRnnStateTensorDescriptor& grads_desc,+    DeviceMemoryBase grads_data,+    const CudnnCtcLossDescriptor& ctc_loss_desc,+    ScratchAllocator* workspace_allocator) {+  auto cudnn = cudnn_->GetHandle(parent_, stream);++  SE_ASSIGN_OR_RETURN(DeviceMemory<uint8> workspace,+                      CreateCtcLossWorkspace(stream, cudnn, ctc_loss_desc,

Can you surface this to ThenCtcLoss? So that DnnSupport's DoCtcLoss takes a workspace memory, instead of a workspace allocator.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 port::StatusOr<DeviceMemory<uint8>> CreateBatchNormBackwardWorkspace(   }   return workspace_allocator->AllocateBytes(workspace_size_in_bytes); }++port::StatusOr<DeviceMemory<uint8>> CreateCtcLossWorkspace(+    Stream* stream, const CudnnHandle& cudnn,+    const CudnnCtcLossDescriptor& ctc_loss_desc,+    const CudnnRnnStateTensorDescriptor& probs_desc,+    const CudnnRnnStateTensorDescriptor& grads_desc,+    const absl::Span<const int32>& labels_data,

All const absl::Span<const int32>& should be absl::Span<const int>. Span acts like a pointer, so there is no need to pass by reference.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 class DnnSupport {     return false;   } +  // Enqueue a CTC Loss operation onto the stream.+  //+  // Arguments:+  //  stream: pointer to the stream where this operation should be enqueued to.+  //  probs_desc: specifies the shape and the data layout of the input tensor.+  //  probs_data: the device memory region that contains the input tensor.+  //  labels_data: the device memory region that contains the labels_value+  //    tensor.+  //  labels_lengths_data: the device memory region that contains the+  //    labels_lengths tensor+  //  input_lengths_data: the device memory region that contains the seq_lengths+  //    tensor+  //  costs_data: the device memory region that contains the costs tensor.+  //  grads_desc: specifies the shape and the data layout of the grads tensor.+  //  grads_data: the device memory region that contains the grads tensor.+  //  ctc_loss_desc: a CTCLoss descriptor created by createCTCLossDescriptor.+  //  workspace_allocator: a memory allocator that creates the temporary+  //    workspace memory used by this operation. The caller is responsible for+  //    keeping the memory alive long enough for this operation, and recylces+  //    afterwards.+  virtual bool DoCtcLoss(Stream* stream,

Please refer to the comment for class DnnSupport, and PrepareForConvolution.

The file itself is already inconsistent. It was way too verbose and I simplified Convolution-related APIs. In the spirit of being as thin of a wrapper as possible (aka simplifying DnnSupport itself), we want to be close to cuDNN / MIOpen without unnecessary cleverness. Please also refer to the comments for DnnSupport class.

Do I need to generalize the element type in this PR, considering the current cuDNN only support float?

I'd argue it's easier to implement and maintain by adding the element type. The key point is that the element type is just passed down to cuDNN without being much inspected by StreamExecutor.

For example, StreamExecutor doesn't have to know which types are supported vs which are not (because that may change with a different version of cuDNN). If a type is not supported, cuDNN will generate an error, and StreamExecutor just forwards it transparently.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 class RnnDescriptor {   virtual ParamsRegions ParamsBiasRegions() const { return ParamsRegions(); } }; +// Specifies the CTC Loss computation.+//+// The user is responsible for releasing this descriptor when it is no longer+// in use. The destructor releases the underlying descriptors.+class CtcLossDescriptor {

This file is not consistent with itself in how to allow implementations customizing the data structure. For example, CovolutionDescriptor isn't virtual. I think we can start with a simple class, and turn it into virtual later.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 class RnnDescriptor {   virtual ParamsRegions ParamsBiasRegions() const { return ParamsRegions(); } }; +// Specifies the CTC Loss computation.+//+// The user is responsible for releasing this descriptor when it is no longer+// in use. The destructor releases the underlying descriptors.+class CtcLossDescriptor {

This file is not consistent in how to allow implementations customizing the data structure. For example, CovolutionDescriptor isn't virtual. I think we can start with a simple class, and turn it into virtual later.

houtoms

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 class RnnDescriptor {   virtual ParamsRegions ParamsBiasRegions() const { return ParamsRegions(); } }; +// Specifies the CTC Loss computation.+//+// The user is responsible for releasing this descriptor when it is no longer+// in use. The destructor releases the underlying descriptors.+class CtcLossDescriptor {

Instead of having a base class, can we simply have:

class CtcLossDescriptor {
 public:
  explicit CtcLossDescriptor(DataType);
  DataType GetComputeType() const;

 private:
  DataType comp_type_;
};
houtoms

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 class DnnSupport {     return false;   } +  // Enqueue a CTC Loss operation onto the stream.+  //+  // Arguments:+  //  stream: pointer to the stream where this operation should be enqueued to.+  //  probs_desc: specifies the shape and the data layout of the input tensor.+  //  probs_data: the device memory region that contains the input tensor.+  //  labels_data: the device memory region that contains the labels_value+  //    tensor.+  //  labels_lengths_data: the device memory region that contains the+  //    labels_lengths tensor+  //  input_lengths_data: the device memory region that contains the seq_lengths+  //    tensor+  //  costs_data: the device memory region that contains the costs tensor.+  //  grads_desc: specifies the shape and the data layout of the grads tensor.+  //  grads_data: the device memory region that contains the grads tensor.+  //  ctc_loss_desc: a CTCLoss descriptor created by createCTCLossDescriptor.+  //  workspace_allocator: a memory allocator that creates the temporary+  //    workspace memory used by this operation. The caller is responsible for+  //    keeping the memory alive long enough for this operation, and recylces+  //    afterwards.+  virtual bool DoCtcLoss(Stream* stream,

Can we have something like:

virtual bool DoCtcLoss(
  Stream* stream,
  dnn::DataType element_type,
  const dnn::RnnStateTensorDescriptor &probs_desc,
  DeviceMemoryBase probs_data,
  absl::Span<const int32> labels_data,
  absl::Span<const int32> labels_lengths_data,
  absl::Span<const int32> input_lengths_data,
  DeviceMemoryBase costs_data,
  const dnn::RnnStateTensorDescriptor &grads_desc,
  DeviceMemoryBase grads_data,
  const dnn::CtcLossDescriptor &ctc_loss_desc,
  DeviceMemoryBase scratch_memory);

? A few points:

  • Pass DeviceMemory and Span by value. They are already pointer types, therefore no need to pass by references.
  • Generalize on element type, so that one function can support multiple element types, instead of just float. See PrepareForConvolution, the virtual DoConvolve, and the class comment of DnnSupport.
  • Separate scratch memory allocation from this function. Users should allocate the scratch memory separately. This is also consistent with cuDNN's API.
houtoms

comment created time in 4 months

Pull request review commenttensorflow/tensorflow

Add support to cuDNN CTC loss

 struct PersistentRnnPlanDeleter {     CHECK_CUDNN_OK(cudnnDestroyPersistentRNNPlan(plan));   } };+#if CUDNN_VERSION >= 7601

StreamExecutor has its own CUDNN handle, and it'd be better for kernels to use the same handle (hence the one in StreamExecutor).

houtoms

comment created time in 4 months

issue commenttensorflow/tensorflow

How to get cublas handle to run cublas function?

@7oud For your own debugging and understanding, it's probably fine to add a getter for blas_.

For a real solution, I suggest you to:

  • Find a way to transport arbitrary data from your application to the execution of your custom op
  • Create your permanent BLAS handle before executing the graph, and expose it to the custom op executor
  • In the custom op executor, use the permanent BLAS handle you created.

Unfortunately I don't know TF enough to tell you where those places are in code.

7oud

comment created time in 4 months

issue commenttensorflow/tensorflow

How to get cublas handle to run cublas function?

As of today cublas handle is for Tensorflow internal use only. It's not exposed to the public C++ API. See tensorflow/stream_executor/cuda/cuda_blas.h that no getters are there for getting the handle blas_.

I don't know much about TF custom ops, but I'd expect somewhere in the executor to have a place for users to keep a handle there.

7oud

comment created time in 5 months

startedgoogle/android-ui-toolkit-demos

started time in 5 months

more