profile
viewpoint
Duncan Riach duncanriach NVIDIA Santa Clara, CA, USA

NVIDIA/tensorflow-determinism 106

Tracking, debugging, and patching non-determinism in TensorFlow

duncanriach/pytorch 0

Tensors and Dynamic neural networks in Python with strong GPU acceleration

duncanriach/tensorflow 0

An Open Source Machine Learning Framework for Everyone

duncanriach/tensorflow-models 0

Models and examples built with TensorFlow

delete branch NVIDIA/tensorflow-determinism

delete branch : doc

delete time in 36 minutes

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 4b58f2ec621456121d4ab31272c051b6c18a9b18

Add info about fused softmax/cross-entropy non-determinism

view details

push time in 36 minutes

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 4b58f2ec621456121d4ab31272c051b6c18a9b18

Add info about fused softmax/cross-entropy non-determinism

view details

push time in 41 minutes

create barnchNVIDIA/tensorflow-determinism

branch : doc

created branch time in an hour

pull request commenthorovod/horovod

Add grouped allreduce feature

FYI, @romerojosh and I had a one-on-one and discussed this current PR. I'm now fully, and finally, up-to-speed on it. :-)

romerojosh

comment created time in 3 days

delete branch NVIDIA/tensorflow-determinism

delete branch : doc

delete time in 3 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 4998692eaf5f4ae4df13444995f306d399efc637

Improve reference to Horovod PR 1130

view details

push time in 3 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 4998692eaf5f4ae4df13444995f306d399efc637

Improve reference to Horovod PR 1130

view details

push time in 3 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 450b7a9da2b982e16b398c2b7e14726b23cf2d22

Improve reference to Horovod PR 1130

view details

push time in 3 days

create barnchNVIDIA/tensorflow-determinism

branch : doc

created branch time in 3 days

issue openedtensorflow/tensorflow

GPU-deterministic back-prop for fused softmax/cross-entropy ops

System information

  • TensorFlow version (you are using): 2.2.0-rc2
  • Are you willing to contribute it: Yes (please assign it to me)

Current Behavior

The back-prop of tf.nn.softmax_cross_entropy_with_logits and tf.nn.sparse_softmax_cross_entropy_with_logits is non-deterministic on GPUs.

Will this change the current api?

Assuming the deterministic back-prop kernels are slower than the current non-deterministic ones, then the deterministic operation will be selectable using the preferred mechanism at the time. At the time of writing, that mechanism is to set the environment variable TF_DETERMINISTIC_OPS to "1" or "true".

Who will benefit with this feature?

Determinism, for both training and inference, is becoming increasingly important as deep learning systems are moved into production, not only because of regulatory requirements of some markets but also, more broadly, because of the massive potential performance advantages of training with determinism; Determinism accelerates and facilitates debug, experimentation (including hyper-parameter tuning and active learning), and regression testing.

With TF_DETERMINISTIC_OPS now enabling deterministic functionality of cuDNN convolution, bias addition, max-pooling, and CTC loss, many deep learning models will train deterministically on GPUs. Softmax and cross entropy are both foundational functions in deep learning, and are combined to ensure performance and numerical stability. Enabling these fused ops to function deterministically will enhance the ability for TensorFlow to be used for various production systems.

Unit Tests

What follows are TensorFlow production-ready unit tests that currently fail, but that will pass when this feature is implemented correctly. The tests can be seen running (and failing) on TensorFlow version 2.2.0-rc2 in this colab.

import tensorflow as tf
import numpy as np

class DeterministicTest(tf.test.TestCase):

  def _randomInts(self, shape, high, dtype):
    return tf.constant(
        np.random.randint(low=0, high=high, size=shape).astype(dtype))

  def _randomFloats(self, shape, dtype, normalized_rows=False):
    a = (2 * np.random.random_sample(shape) - 1).astype(dtype)

    if normalized_rows:

      def normalize(row):
        return row / row.sum()

      a = np.apply_along_axis(normalize, 1, a)

    # print("normalize: %r" % normalized_rows)
    # print(a)

    return tf.constant(a)
    
  def _testDeterministicGradients(self, exclusive_labels):
    with self.session(force_gpu=True):
      batch_size = 1024
      classes_count = 1000
      logits_shape = (batch_size, classes_count)
      logits_dtype = np.float32
      logits = self._randomFloats(logits_shape, logits_dtype)
      if exclusive_labels:
        labels_shape = (batch_size)
        labels_dtype = np.int32
        labels = self._randomInts(labels_shape, classes_count, labels_dtype)
      else:
        labels_shape = logits_shape
        labels_dtype = logits_dtype
        labels = self._randomFloats(labels_shape, labels_dtype,
                                    normalized_rows=True)
      output_shape = (batch_size)
      output_dtype = logits_dtype

      def gradients(local_seed):
        np.random.seed(local_seed)
        upstream_gradients = self._randomFloats(output_shape, output_dtype)
        with tf.GradientTape(persistent=True) as tape:
          tape.watch(logits)
          if exclusive_labels:
            tested_op = tf.nn.sparse_softmax_cross_entropy_with_logits
          else:
            tested_op = tf.nn.softmax_cross_entropy_with_logits
          op_output = tested_op(labels=labels, logits=logits)
          gradient_injector_output = op_output * upstream_gradients
        return tape.gradient(gradient_injector_output, logits)

      repeat_count = 5
      for seed in range(repeat_count):
        result_a = gradients(seed)
        result_b = gradients(seed)
        self.assertAllEqual(result_a, result_b)

  def testExclusiveLabelsDeterministicGradients(self):
    self._testDeterministicGradients(exclusive_labels=True)

  def testDistributionLabelsDeterministicGradients(self):
    self._testDeterministicGradients(exclusive_labels=False)

if __name__ == '__main__':
  tf.test.main()

created time in 4 days

issue commentNVIDIA/tensorflow-determinism

Getting OpenNMT-tf to train reproducibly

Hi @atebbifakhr,

After further investigation, there seems to be at least three sources of non-determinism in this system.

  1. Confirmed that back-prop of tf.nn.sparse_softmax_cross_entropy_with_logits does inject non-determinism.
  2. Discovered that tf.keras.optimizers.Optimizer::apply_gradients seems to inject non-determinism when it reduces gradients across the batch.
  3. Discovered that the “inputters”, which do some kind of mapping through a (trainable?) word embedding, are injecting non-determinism by making the samples applied to the model non-reproducible after the first step.

There is a lot more work to do on this issue, but I wanted to give you an interim update.

I've also added your name to the credits section of this repo in recognition of your effort in enabling me to reproduce the problems you've been seeing.

atebbifakhr

comment created time in 4 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha ca08fd2793e9d779f568bdc9e34b4a12aa4b10d1

Small clarification

view details

push time in 4 days

delete branch NVIDIA/tensorflow-determinism

delete branch : doc

delete time in 4 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 105a96ba0ef2bc8b3d9c6ac1ea4fce579433fde4

Improve references to GitHub issues

view details

push time in 4 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 105a96ba0ef2bc8b3d9c6ac1ea4fce579433fde4

Improve references to GitHub issues

view details

push time in 4 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 3a701d2467a35980918ff11604e5079b61e0803a

Improve references to GitHub issues

view details

push time in 4 days

create barnchNVIDIA/tensorflow-determinism

branch : doc

created branch time in 4 days

issue commenttensorpack/tensorpack

How to run Tensorpack training with deterministic behavior

Hi @jwook1004, several sources of GPU-related non-determinism have now been addressed in TensorFlow. You may find that by setting the environment variable TF_DETERMINISTIC_OPS to "true" or "1" resolves the issues you were seeing.

Work is ongoing, but for more information, please see https://github.com/NVIDIA/tensorflow-determinism

jwook1004

comment created time in 4 days

issue commenttensorflow/tensorflow

CUDA implementation of BiasAddGrad op is non-determinstic

@eamartin, this issue was resolved, in Oct 2019, with PR 31465. In TensorFlow version 2.1 and onwards, you can enable the solution by setting the environment variable TF_DETERMINISTIC_OPS to "1" or "true".

For more information, see [github.com/NVIDIA/tensorflow-determinism].

eamartin

comment created time in 4 days

issue commenttensorflow/tensorflow

Feature Request: Support for configuring deterministic options of cudNN Conv routines

I have resolved this issue. Please will someone close it. I cannot close it. @yoavz can you close it?

yoavz

comment created time in 4 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 412918fc4115e764be9b0fa8ce7e50e1e887006c

Note that TF PR 38089 is closed

view details

push time in 5 days

PR closed tensorflow/tensorflow

Add reminder to test deterministic cuDNN CTC loss cla: yes size:XS

@sanjoy added deterministic cuDNN CTC loss, enabled via TF_DETERMINISTIC_OPS, with this commit. This current pull request places a reminder in cudnn_deterministic_base.py for me to add a test for it.

+2 -1

1 comment

1 changed file

duncanriach

pr closed time in 5 days

pull request commenttensorflow/tensorflow

Add reminder to test deterministic cuDNN CTC loss

Moved this to issue 38151, as suggested by @sanjoy. Closing.

duncanriach

comment created time in 5 days

issue commenttensorflow/tensorflow

Add reminder to test deterministic cuDNN CTC loss

Please will someone with the authority assign this issue to me?

duncanriach

comment created time in 5 days

issue openedtensorflow/tensorflow

Add reminder to test deterministic cuDNN CTC loss

@sanjoy added deterministic cuDNN CTC loss, enabled via TF_DETERMINISTIC_OPS, with this commit. This issue is a reminder to add test for it in tensorflow/python/kernel_tests/cudnn_deterministic_base.py.

created time in 5 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha faaecdf7cb6346af33ff0c8bbcb09ea2f2070f50

Add reference to TF PR 38089

view details

push time in 6 days

PR opened tensorflow/tensorflow

Add reminder to test deterministic cuDNN CTC loss

@sanjoy added deterministic cuDNN CTC loss, enabled via TF_DETERMINISTIC_OPS, with this commit. This current pull request places a reminder in cudnn_deterministic_base.py to remind me to add a test for it.

+2 -1

0 comment

1 changed file

pr created time in 6 days

create barnchduncanriach/tensorflow

branch : deterministic-cudnn-ctc-loss

created branch time in 6 days

push eventduncanriach/tensorflow

A. Unique TensorFlower

commit sha dce540f5c7f60ebd21e616f70a16a9e8ec0fa7bf

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301019734 Change-Id: I3fd0e9f587bd4ebae09c657ca6e3cce5917064ea

view details

Anjali Sridhar

commit sha d08e6aeb4944364a28284fa9f8ced211ac936699

Fix docs for tf.distribute.DistributedValues. PiperOrigin-RevId: 301037707 Change-Id: I09630bf534a6e99fe1a197cd0fb5bbfe87fec3fc

view details

Davide Libenzi

commit sha 49dc8bb1d6c0ea8fc10dc4763c63dd5b42cea360

Add utility API to implement strided dynamic slice. PiperOrigin-RevId: 301039665 Change-Id: Ib7bf50a7c27bb531561db482e33bab7b54b1769d

view details

A. Unique TensorFlower

commit sha 7b4913a4cd8b7101920fbacab28a34d61ddf0c8a

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301042590 Change-Id: I668348f22461b6d993bc153ee1dc9e0f397ac68d

view details

A. Unique TensorFlower

commit sha b660c88c055be40959030fc754e90a1fdc6f1389

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301051233 Change-Id: I4fb455505d0f503c01a29a47c0c6185f87b55ad6

view details

Thomas O'Malley

commit sha eea7bbc772ad2a12b24a83449258fa582215715b

Allow unused keys when a dict is passed to a single-input Functional API model. Ensure that the key mapping to the name of the Input is used during Model execution. PiperOrigin-RevId: 301053395 Change-Id: I7f5bfffc3e034b064b3cd4129e07f000df11cb6b

view details

A. Unique TensorFlower

commit sha 4161fe327c5073850e96682842e482c548839087

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301069384 Change-Id: Ice98669bb20ea6c9952795b8f5df2d7a85a2ef6b

view details

Juho Ha

commit sha 1e22d3d42879dd3ed61ddaa8911d9889c9d3cc91

Hide class definition of ATraceProfiler PiperOrigin-RevId: 301071805 Change-Id: I26d6911da93ae31f62385e0a54bd872370a3e834

view details

A. Unique TensorFlower

commit sha abad79508b4fe77fa4bc01fc79a64b45efa49829

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301099483 Change-Id: I836e7503f42588214edf8568d148365d196f3e43

view details

A. Unique TensorFlower

commit sha 5f0594c4d554257495ac606fee38b84424ac57b0

compat: Update forward compatibility horizon to 2020-03-16 PiperOrigin-RevId: 301112127 Change-Id: I4bcb86232a7d536d918a05b2d77c7eef901ce936

view details

A. Unique TensorFlower

commit sha 91e2e5745fed63f3885c5537810fad79dfde271c

Add MetadataDisplayer.with_model_buffer to allow instantiations with buffers PiperOrigin-RevId: 301114446 Change-Id: Ia4957233999ed1168280937755a470354f3aa16a

view details

A. Unique TensorFlower

commit sha 2ec7a67766ab00dc35d1b16e5184df5b8cdbdf96

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301125439 Change-Id: Ia7dcce421d36cda669b426664bd8d6ddaacf2938

view details

A. Unique TensorFlower

commit sha faa85ab76d9188a4bf6f231f696157fa29ca8d3d

Move forward compatibility date by 3 months. PiperOrigin-RevId: 301130212 Change-Id: I2eef76e7b4fc22723978b353dc3896500107627e

view details

Fabio Di Domenico

commit sha 5ae1f6d934008c7a5c6f094202ae738574e4487e

Expose ability to enable NNApi in C api

view details

A. Unique TensorFlower

commit sha cd2466610c5dda695f95ab85fd6211c9e30d835b

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301139083 Change-Id: Icbd18c2b334a0d095ecbe21c742717cbd0c07b76

view details

A. Unique TensorFlower

commit sha 636b812bb3d95ce26bba5fb6d5f918a8a7bc2c88

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 301155870 Change-Id: I027d272f792fa52c47ceafc3d419a57a3412cfdf

view details

Anjali Sridhar

commit sha 32f5b7dd76eb11608534e5b31f0ab5b3b635fb3d

Fix doc string for `experimental_distribute_values_from_function`. PiperOrigin-RevId: 301157595 Change-Id: I89479434938e0828e9a0db775f4b5448f30dbab2

view details

T.J. Alumbaugh

commit sha 6072451f8b17d05336407aeeead6c5455a7dfc6a

Initial support for Einsum op in TF Lite PiperOrigin-RevId: 301159606 Change-Id: Ie8d9ef631ea64061423f95c06050b6329c844847

view details

TensorFlower Gardener

commit sha 3e1cc6b6aa56aa16f9241fddad52c9422e8cbe65

Merge pull request #37589 from Nick-Morgan:master PiperOrigin-RevId: 301162565 Change-Id: I4c9039cae0ed241cd4ddcb26e39d14c14d1b95ae

view details

TensorFlower Gardener

commit sha b740eba8a72cc7af6591e36361f86897efdf1790

Merge pull request #37592 from ashutosh1919:image_rgb_to_yub PiperOrigin-RevId: 301162969 Change-Id: Ibacb378c248406661e6692a22e73f983607df11b

view details

push time in 6 days

delete branch duncanriach/tensorflow

delete branch : deterministic-xla-reductions-follow-up

delete time in 6 days

issue closedNVIDIA/tensorflow-determinism

Will this package work on tf.keras model?

Regarding "stock tensorflow 2.1", I installed tensorflow by using pip install tensorflow-gpu==2.1.0, shall I need strictly follow the pip installation command in your README file?

Also, if using tf.keras, shall I expect to get deterministic experiment results?

closed time in 6 days

leocnj

issue commentNVIDIA/tensorflow-determinism

Will this package work on tf.keras model?

No, pip install tensorflow-gpu==2.1.0 and pip install tensorflow=2.1.0 do the same thing. With TensorFlow 2.1 and onwards, tensorflow-gpu and tensorflow are essentially the same package. I chose to show the newer way of doing things in my README file.

Yes, using tf.keras, you should get all the deterministic goodness that is in TensorFlow. Remember that there are some TensorFlow ops, which may be exposed through Keras, that will inject non-determinism, so keep any eye out for that.

Use the following function to get and print a summary of your model weights at the end of training. If the summary is the same on separate training runs then your model is training with perfect reproducibility.

def summarize_weights(model):
  w = model.get_weights()
  return sum(map(lambda x: x.sum(), w))

Note: I intend to add this function to the tensorflow-determinism package at some point.

leocnj

comment created time in 6 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 40d56b4f68af9af1df05e640f47653bbfa9262a1

Improve stock TF commit description

view details

push time in 10 days

delete branch NVIDIA/tensorflow-determinism

delete branch : doc

delete time in 10 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha efb143b868ea9ff97540df1a8e179d56324219ef

Track addition of deterministic cuDNN CTC loss

view details

push time in 10 days

create barnchNVIDIA/tensorflow-determinism

branch : doc

created branch time in 10 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 2785a7cdafec8058d4a5720cbaa50332cad3bf96

Fix typo in scripts README

view details

push time in 13 days

issue commentNVIDIA/tensorflow-determinism

Getting seq2seq to operate reproducibly

Hi @atebbifakhr, I looked into this more deeply. Removing tf.nn.sparse_softmax_cross_entropy_with_logits from the loss function only makes the gradients reproducible for the first step. They still go non-deterministic on the second step. The trainable variables actually go non-deterministic on the first step (somehow) regardless of whether tf.nn.sparse_softmax_cross_entropy_with_logits is in the loss function.

The fact that the gradients are deterministic for the first step but the trainable variables are not suggests that non-determinism is being introduced in the gradient update step. I hope to continue investigating soon.

atebbifakhr

comment created time in 14 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 937a8befea930a3544abe23ae0e22084ce9e6f0e

(Really) note that PyTorch PR 33795 was merged prior to creation of PyTorch v1.5 branch

view details

push time in 17 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 5e49b680c9ee6aaef980cdc3f10ab1947432dd54

Note that PyTorch PR 33795 was merged prior to creation of PyTorch v1.5 branch

view details

push time in 17 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 1542279e529af8c38c703c78692b80b3791261e1

Note that TF PR 37377 has been merged prior to creation of TF v2.3 branch

view details

push time in 17 days

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 36cc00a78d315fcd74562a8c2ca8f5e2e5607a13

Confirm TensorFlow issue 33660 is resolved in TF v2.1.0

view details

push time in 17 days

issue closedtensorflow/tensorflow

tf.test.compute_gradient (v2) expects wrong empty gradient shape

System Information

  • Have I written custom code: YES (see below)
  • OS Platform and Distribution: Ubuntu 18.04.3 LTS (inside tensorflow/tensorflow:2.0.0-gpu)
  • Mobile device: N/A
  • TensorFlow installed from: binary (inside tensorflow/tensorflow:2.0.0-gpu)
  • TensorFlow version: 2.0.0
  • Python version: 2.7.15+
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: 10.0
  • GPU model and memory: TITANV 12GB

Current Behavior When the gradient tensors are totally empty (i.e. all their dimensions are zero), and more than one input is considered at the same time, the shape of an expected gradient tensor does not match the actual shape produced by the op's gradient function. This case is not currently tested in tensorflow/python/ops/gradient_checker_v2_test.py (see testEmptyMatMul).

UPDATE: I also tried the configuration in testEmptyMatMul (see additional code below) and found that this also tickles the apparent bug. Surprisingly, //tensorflow/python:gradient_checker_v2_test passes when running on tag v2.0.0; I don't currently understand why.

I have tested

  • with two different ops, each with two inputs (both fail).
  • cases where not all of the gradient tensors are empty (as in testEmptyMatMul).
  • tf.compat.v1.test.compute_gradient with more than one empty input does not exhibit this behavior, so it was probably not present in tensorflow/python/ops/gradient_checker.py but was introduced in tensorflow/python/ops/gradient_checker_v2.py. See code at the end, in the "TensorFlow Version 1 API" section, that demonstrates this.

I have not tested

  • using ops with more than two inputs

Expected Behavior These exceptions should not be produced when using built-in ops. Either the ops' gradient functions are producing the wrong shaped tensors (which would be a bug), or, more likely, the code in gradient_checker_v2.py is expecting the wrong shape.

Code to reproduce the issue

import numpy as np
import tensorflow as tf

def empty(rank):
  shape = (0,) * rank
  return np.array([]).reshape(shape)

# comment-out the first to run the second
tf.test.compute_gradient(tf.nn.bias_add, [empty(3), empty(1)])
tf.test.compute_gradient(tf.linalg.matmul, [empty(2), empty(3)])

The following is essentially the same as the code in testEmptyMatMul in tensorflow/python/ops/gradient_checker_v2_test.py:

import numpy as np
import tensorflow as tf

def random_tensor(shape):
  return tf.constant(np.random.random_sample(shape))

def f(x, y):
  return tf.linalg.matmul(x, y)

x = random_tensor((0, 3))
y = random_tensor((3, 4))

jacobians = tf.test.compute_gradient(f, [x, y])

Other info / logs

Output from the above code:

Traceback (most recent call last):
  File "repro_eager_issue.py", line 24, in <module>
    tf.test.compute_gradient(tf.nn.bias_add, [empty(3), empty(1)])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 332, in compute_gradient
    return _compute_gradient_list(f, x, delta)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 293, in _compute_gradient_list
    xs, i, delta) for i in range(len(xs))])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 278, in _compute_gradient
    xs, param)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 187, in _compute_theoretical_jacobian
    (x.shape, grad.shape))
ValueError: Empty gradient has wrong shape: expected (0,), got (0, 0, 0)

Traceback (most recent call last):
  File "repro_eager_issue.py", line 25, in <module>
    tf.test.compute_gradient(tf.linalg.matmul, [empty(2), empty(3)])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 332, in compute_gradient
    return _compute_gradient_list(f, x, delta)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 293, in _compute_gradient_list
    xs, i, delta) for i in range(len(xs))])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 278, in _compute_gradient
    xs, param)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 187, in _compute_theoretical_jacobian
    (x.shape, grad.shape))
ValueError: Empty gradient has wrong shape: expected (0, 0, 0), got (0, 0)

Traceback (most recent call last):
  File "tf_issue_33660.py", line 77, in <module>
    existing_test_repro()
  File "tf_issue_33660.py", line 71, in existing_test_repro
    jacobians = tf.test.compute_gradient(f, [x, y])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 332, in compute_gradient
    return _compute_gradient_list(f, x, delta)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 293, in _compute_gradient_list
    xs, i, delta) for i in range(len(xs))])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 278, in _compute_gradient
    xs, param)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/gradient_checker_v2.py", line 187, in _compute_theoretical_jacobian
    (x.shape, grad.shape))
ValueError: Empty gradient has wrong shape: expected (3, 4), got (0, 3)

Work-Around

compute_gradient can be called multiple times, once for each input, to get the analytical and numerical jacobians for each input separately. As, follows:

import numpy as np
import tensorflow as tf

def empty(rank):
  shape = (0,) * rank
  return np.array([]).reshape(shape)

input_val = empty(3)
bias_val = empty(1)

def bias_add_1(input_val):
  return tf.nn.bias_add(input_val, bias_val)

def bias_add_2(bias_val):
  return tf.nn.bias_add(input_val, bias_val)

input_jacobians = tf.test.compute_gradient(bias_add_1, [input_val])
bias_jacobians = tf.test.compute_gradient(bias_add_2, [bias_val])

TensorFlow Version 1 API

The following code does not throw an exception, demonstrating that this problem does not exist with tf.compat.v1.test.compute_gradient.

import numpy as np
import tensorflow as tf

def empty_tensor(shape):
  return tf.constant([], shape=shape)

tf.compat.v1.disable_eager_execution()
input_shape = output_shape = (0, 0, 0)
bias_shape = (0,)
input_tensor = empty_tensor(input_shape)
bias_tensor = empty_tensor(bias_shape)
output_tensor = tf.nn.bias_add(input_tensor, bias_tensor)
with tf.compat.v1.Session() as sess:
  jacobians = tf.compat.v1.test.compute_gradient(
      [input_tensor, bias_tensor], [input_shape, bias_shape], output_tensor,
      output_shape)

closed time in 17 days

duncanriach

issue commenttensorflow/tensorflow

tf.test.compute_gradient (v2) expects wrong empty gradient shape

Thank you, @gadagashwini. I can confirm that this issue is resolved in TensorFlow version 2.1.0. Closing.

duncanriach

comment created time in 17 days

issue commenttensorflow/tensorflow

CUDA broken on docker image tensorflow/tensorflow:devel-gpu

I was about to open a new issue, and then GitHub figured out exactly what I was looking for and brought me to this issue. I seem to be dealing with a slightly different flavor of the same problem, with the same root cause. Here is my issue description:

Title: TensorFlow will not run on GPU in devel-gpu container (because of CUDA stubs)

Reproduction Instructions

$ docker run -it -w /tensorflow_src tensorflow/tensorflow:devel-gpu
# bazel test --config=cuda -c opt --test_output=all --verbose_failures -- //tensorflow/python/kernel_tests:bias_op_deterministic_test

Other Information

The source of the problem is that LD_LIBRARY_PATH includes /usr/local/cuda/lib64/stubs:

# echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64/stubs:/usr/include/x64_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

Removing this resolves the problem:

# LD_LIBRARY_PATH="${LD_LIBRARY_PATH//\/usr\/local\/cuda\/lib64\/stubs:}"
# echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/include/x64_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
# bazel test --config=cuda -c opt --test_output=all --verbose_failures -- //tensorflow/python/kernel_tests:bias_op_deterministic_test

I understand that /usr/local/cuda/lib64/stubs is used to enable the TensorFlow wheel to be built in the absence of a real CUDA driver, and I assume that the devel-gpu container is used for this purpose. However, this makes it harder and more confusing to use the devel-gpu container for development.

Apparently, there are preferable ways to allow the build to successfully complete in the absence of a real CUDA driver, and these ways do not require the presence of /usr/local/cuda/lib64/stubs in LD_LIBRARY_PATH. Four different approaches are discussed here.

My intentions in filing this issue are,

  1. to provide thorough documentation to help other struggling with it,
  2. to organize a discussion about how to resolve this (apparently recurring) issue, and
  3. to drive resolution.
lezh

comment created time in a month

Pull request review commenttensorflow/tensorflow

[XLA] follow-up on GPU-deterministic reductions

 Status GpuCompiler::PrepareHloModuleForIrEmitting(HloModule* hlo_module) { // TODO(cheshire): Duplication with gpu_conv_algorithm picker, figure out a // right way to share this. static bool RequireDeterminism() {-  bool deterministic_ops = false;-  TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS",-                                             /*default_val=*/false,-                                             &deterministic_ops));-  return deterministic_ops;+  static bool require_determinism = [] {

They have functionality that is different enough that it might not be worth trying to find the common factor, especially since they're intended to be temporary. If they were combined and placed in another source file, does one that already exists stand out to you as a good candidate, or would a new one make more sense?

duncanriach

comment created time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha a5820b50bb9470d7c64ee819095d38d450b75c3a

Add reference to stock TF PR 37377

view details

push time in a month

PR opened tensorflow/tensorflow

[XLA] follow-up on deterministic XLA reductions

Deterministic tf.nn.bias_add (enabled with TF_DETERMINISTIC_OPS) and its testing was introduced via PR 31465. Operation on XLA:GPU was found to be non-deterministic at that time.

In response to a conversation on PR 34887 (Add info about TF_DETERMINISTIC_OPS to version 2.1 release notes), @cheshire committed a change that implemented deterministic reduction functionality when using XLA:GPU. He then committed another change that caused this functionality to be enabled by TF_DETERMINISTIC_OPS. Both of these changes are in the r2.2 branch.

This current pull-request is a follow-up to that conversation. It does two things:

  1. Enable the deterministic testing for tf.nn.bias_add to run on the XLA:GPU.
  2. Modify the way that tensorflow/compiler/xla/service/gpu/gpu_compiler.cc listens to TF_DETERMINISTIC_OPS so that it's the same as all other uses (it caches the value).

It would be ideal to cherry-pick these changes into the 2.2 branch so that they can accompany the rest of this feature implementation.

+8 -6

0 comment

2 changed files

pr created time in a month

create barnchduncanriach/tensorflow

branch : deterministic-xla-reductions-follow-up

created branch time in a month

push eventduncanriach/tensorflow

Dayananda-V

commit sha 640edaeef6c3a7ae8b84381be74665b3e5b26981

TfLite one_hot int8 and unit8 feature support 1- int8 and uint8 data type support change 2- supported data type test coverage added

view details

Guy David

commit sha f132386f03d42f54b8e5e0ac6cb011f5a1d2d4a2

Verify trivial operator has no activation function

view details

MoussaMM

commit sha 1711b76f95b49dcf597fe5b2ec5f4ff79ddbc7a7

check if boxes have logical coordinates (different than a line or a point)

view details

Khor Chean Wei

commit sha 4c9a8d78628fe2d087d4f8ef10a2a87ba1e78f20

Change the constructor in class Interpreter

view details

Gaurav Singh

commit sha 15a09654f47514606b14a702ba3875d98575210c

[core] check index channel before accessing center_frequencies_

view details

Tom Carchrae

commit sha a8c6704074385dc2e4dd3f8282a0df86606a4dc5

git is required by several tutorial examples

view details

Artem Mavrin

commit sha 198cba7bca0caf901145c6735f1ca119f8263fbb

Removed NDIMS template arg from BinaryElementWiseOp::Operate

view details

Artem Mavrin

commit sha 108d1e9202701fd5a9f24e0a5ed6e849e3ec5c9b

Fixed undeclared identifier 'alpha'

view details

Tom Carchrae

commit sha 3b29c0b1eaf83466ca2a31daf2a807fa3dd4a3d6

update generated files

view details

Mitchell Vitez

commit sha cdf921fe6ec11ff288af43b8c2faa74ad02d482f

Fix typos in the base_layer input casting warning

view details

Tamas Bela Feher

commit sha dfc227d6e2a06655cef8661fb883150aafe4f1e2

Add TRT profile generation mode

view details

Tamas Bela Feher

commit sha 7f7fcf750897d9dcf4189d4031f3824388a40ba9

Run profile generation mode if needed

view details

MoussaMM

commit sha 50647fa8bdff73e4d79ea47e98ee36b41ee23721

Correcting code style problems

view details

Frederic Bastien

commit sha c7ce71f168bb2be59ec7a22117cecb8466872960

[XLA] Fix a latent bug. Currently I think the code is fine, but the function doesn't do what its name say.

view details

jojimonv

commit sha 1fda61cf360d25156dae971eb9fe63544ed0efc1

Changes for DNNL1.x fused batch norm

view details

Frederic Bastien

commit sha c989e34dbc67979d765539b004744a22ad258277

ReductionTiling per SM.

view details

Frederic Bastien

commit sha c9878bb9124240060bde0aa2c4fa4f88de5138f3

Reduction Tiling dependant of dtype.

view details

Xiaoquan Kong

commit sha 4643aa057de812175473a4be53fa9d2cd173d3b7

docfix: fix docstring error in KerasHistory

view details

lyonguyen8697

commit sha b192692e47c44be57ed5bbc40dfcc2d05dcd2501

Add sorted builtin support for autograph

view details

lyonguyen8697

commit sha 3e5a96f7ed3ca77d4cb813a8d1fcf35eed18de5c

Add test case for sorted builtin support in autograph

view details

push time in a month

pull request commentpytorch/pytorch

Enhance reproducibility documentation

Thanks @ngimel. I've updated and rebased this. Hopefully it's now ready to be merged.

duncanriach

comment created time in a month

push eventduncanriach/pytorch

Jerry Zhang

commit sha a13ee1898207dbbd4e89ddd300707f087a84d12c

[quant][graphmode] refactor nodeQuantizable (#33171) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33171 For better code reuse Test Plan: . Imported from OSS Differential Revision: D20087845 fbshipit-source-id: f88cffb410bd54a1b3f937786104f46bcd1190d3

view details

Jerry Zhang

commit sha 86673791335569fe851c49202b715767f8dc302f

[quant][graphmode][refactor] Factor out insertDequantCall (#33172) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33172 For code reuse Test Plan: . Imported from OSS Differential Revision: D20087842 fbshipit-source-id: 797868d31b96c4ff8640121ea4bee1396deb6b57

view details

Haixin Liu

commit sha 038ee01393734575f26eb225ea03461170800232

Disable printing of the histogram when dump (#33749) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33749 Disable printing of the histogram when dump to make the log cleaner. Test Plan: CI Reviewed By: amylittleyang Differential Revision: D20087735 fbshipit-source-id: 5421cd9d25c340d92f29ce63fed2a58aefef567d

view details

joerg-de

commit sha 5bac7febad036f8a9e124036f6c2509a540dd588

removed padding and dilation from LPPool2d Doc (#33714) Summary: removed padding and dilation from LPPool2d Doc as the function dose not support padding and dilation Pull Request resolved: https://github.com/pytorch/pytorch/pull/33714 Differential Revision: D20097021 Pulled By: ngimel fbshipit-source-id: fc1c2d918b32f4b45c7e6e6bd93f018e867a628f

view details

Stas Bekman

commit sha 9a5ea713804a18c927baeed4f6f93dce753166ae

pad_packed_sequence: doc improvement (#33768) Summary: pad_packed_sequence: 1. clarify that batch's order is restored to the original one 2. add example This is a follow up to https://github.com/pytorch/pytorch/issues/33746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33768 Differential Revision: D20102792 Pulled By: ngimel fbshipit-source-id: 5ef511e5e3833edcb85cc01af0e92568b6d7a3cf

view details

Emilio Castillo

commit sha a836c4ca78b72ecc8e0664e1b684af64ce83be42

Skip manual backward for `cdist` with case `p=2` (#31167) Summary: Fixes an issue with `cdist` backward calculation for large inputs for the euclidean case. The grid size when launching the kernel exceeded the 2^16 limit for the second dimension, resulting in `RuntimeError: CUDA error: invalid configuration argument` Code to reproduce: ``` h, w, d = 800, 1216, 12 n = 133 A = torch.randn(n, d).cuda() B = torch.randn(h, w, d).cuda() A.requires_grad = True B.requires_grad = True B = B.reshape(-1, d).contiguous() dist = torch.cdist(A, B) loss = dist.sum() loss.backward() ``` Thanks to tkerola for the bug report, reproduction and suggesting a solution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31167 Differential Revision: D20035605 Pulled By: ngimel fbshipit-source-id: ae28ba4b549ee07a8bd937bb1de2438dc24eaa17

view details

Michael Carilli

commit sha fc6a1536889cec61fd9e9ebad71e044ec222a1be

[WIP] Reanimate gradient scaling API with original scale update heuristic (#33366) Summary: Also, windows memory failures responsible for the earlier reversion have been fixed. This PR (initially) contains 2 commits: * a revert of the revert * all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366 Differential Revision: D20099026 Pulled By: ngimel fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529

view details

Martin Yuan

commit sha 758ad516f32708de243f194144ee0f7b9e0f5117

[Lite interpreter] Pass shared_ptr properly (#33667) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33667 Pass shared_ptr properly according to C++ guidances. Thank kimishpatel for pointing it out. Test Plan: Imported from OSS Differential Revision: D20111001 Pulled By: iseeyuan fbshipit-source-id: 213a0f950a7f3b9199d789dc0155911f6102d77a

view details

Ahmad Salim Al-Sibahi

commit sha 24659d28a1c92fed181f4971ec6d2883fbc06046

Feature/vonmises upstream (#33418) Summary: Third try of https://github.com/pytorch/pytorch/issues/33177 😄 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33418 Differential Revision: D20069683 Pulled By: ezyang fbshipit-source-id: f58e45e91b672bfde2e41a4480215ba4c613f9de

view details

Xiao Wang

commit sha c1dd70688a28d2bc91ac1e10dcda62d4c7bbebce

Fix deprecated python "add" calls (#33428) Summary: This PR fixed those python "add" calls using deprecated signature `add(Scalar, Tensor)`. The alternative signature `add(Tensor, alpha = Scalar)` is used. cc csarofeen zasdfgbnm ptrblck ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/33428 Differential Revision: D20002534 Pulled By: vincentqb fbshipit-source-id: 81f2dd6170a47a9b53a17e5817c26e70d8afa130

view details

Hong Xu

commit sha f87b0b2515bc19bc2df33d2a7a3b37695f73ea08

Remove the use of macros in defining binary ops for base Vec256 (#33733) Summary: This greatly improves readability and maintainability (e.g., debugging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/33733 Differential Revision: D20103187 Pulled By: ezyang fbshipit-source-id: e539e46f5d378a2b01da7ecaa6b850655e0fa866

view details

Peter Bell

commit sha 2eb95d8f4a1717cc93b1c63a02d1d51117bfde88

Migrate `fmod` and `fmod_` from TH to ATen (CPU) (#33592) Summary: Closes https://github.com/pytorch/pytorch/issues/24701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33592 Differential Revision: D20043875 Pulled By: ezyang fbshipit-source-id: b8c0a4e73a3cef6e55e91bbd35f8aadca8114c56

view details

Daya Khudia

commit sha a8e7ed48f469e561d89f724fcbf290eeb3967eb5

[pt][quant] Parallelize quantize and dequantize (#33765) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33765 quantize and dequantize methods now make use of multiple threads. This makes use of shz0116's recent parallelization of quantize/dequantize routines in FBGEMM. Fixes: https://github.com/pytorch/pytorch/issues/32006 https://github.com/pytorch/FBGEMM/issues/142 Alternative to https://github.com/pytorch/pytorch/pull/30153 ``` #!/usr/bin/env python import time import torch import torch.nn as nn torch.set_num_threads(4) # print(torch.__config__.parallel_info()) W = torch.rand(1, 54, 54, 256) NITER = 1000 s = time.time() for i in range(NITER): W_q = torch.quantize_per_tensor(W, scale=1.0, zero_point = 0, dtype=torch.quint8) time_per_iter = (time.time() - s) / NITER print('quantize time per iter ms', time_per_iter * 1000) s = time.time() for i in range(NITER): W_deq = W_q.dequantize() time_per_iter = (time.time() - s) / NITER print('dequantize time per iter ms', time_per_iter * 1000) ``` ### With 1 thread quantize time per iter ms 0.22633790969848633 dequantize time per iter ms 0.6573665142059326 ### With 4 threads quantize time per iter ms 0.0905618667602539 dequantize time per iter ms 0.19511842727661133 ghstack-source-id: 98935895 Test Plan: python test/test_quantized.py Reviewed By: jspark1105 Differential Revision: D20098521 fbshipit-source-id: bd8c45761b4651fcd5b20b95759e3868a136c048

view details

Michela Paganini

commit sha b8f0acf50f97f130a637592dc9dccbfef204c42b

Fix examples with updated pruning naming convention (#33144) Summary: Fix in docs requested by vainaijr. Closes issue https://github.com/pytorch/pytorch/issues/32991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33144 Differential Revision: D20104640 Pulled By: albanD fbshipit-source-id: 9b1be2c1cbde1964967967a9581bb6932a305d81

view details

Barak Nehoran

commit sha f597ac6efc70431e66d945c16fa12b767989b032

Fix grid_sample gradients at image borders (#32829) Summary: Fixes https://github.com/pytorch/pytorch/issues/23925 This fixes the incorrect gradients returned by `F.grid_sample` at image borders under `"border"` and `"reflection"` padding modes. At nondifferentiable points, the choice of which gradient to return among its super- or subgradients is rather arbitrary and generally does not affect training. Before this change, however, a bug in the code meant that the gradient returned at the exact borders was not selected from among the super- or subgradients. The gradient is now set to zero at the borders, which is a defensible choice for both the `"border"` and `"reflection"` padding modes: * For `"border"` padding, this effectively means that the exact borders of the image are now considered out of bounds, and therefore receive zero gradient. * For `"reflection"` padding, this effectively treats the exact borders as extrema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32829 Differential Revision: D20118564 Pulled By: soumith fbshipit-source-id: ef8571ff585be35ab1b90a922af299f53ab9c095

view details

Gao, Xiang

commit sha 45e4b614d1c8b529967b7f039677853d73911076

Per channel quantization performance improvement (#33772) Summary: Benchmark: NVIDIA GTX 1650 + AMD Ryzen Threadripper 3970X ```python import torch print(torch.__version__) for i in range(1000): torch.randn(1024 * 128, device='cuda') def cuda(e): a = torch.randn(2 ** e, 32, device='cuda') s = torch.randn(32, device='cuda') z = torch.randn(32, device='cuda') torch.cuda.synchronize() %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999); torch.cuda.synchronize() def cpu(e): a = torch.randn(2 ** e, 32, device='cpu') s = torch.randn(32, device='cpu') z = torch.randn(32, device='cpu') %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999); for i in range(10, 24): cuda(i) print() for i in range(10, 32): cpu(i) ``` Before ``` 1.5.0a0+9bc922d 849 µs ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 817 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 814 µs ± 2.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.11 ms ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.19 ms ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.6 ms ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.44 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.14 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.41 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 13.9 ms ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 26.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 52.6 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 104 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 207 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 249 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 420 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 766 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.45 ms ± 574 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.84 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.69 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.29 ms ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.32 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 17.4 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 47.5 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 187 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 379 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 652 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.22 s ± 4.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 2.34 s ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 4.56 s ± 7.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 8.97 s ± 33.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 17.8 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 35.2 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` After ``` 1.5.0a0+a7ec8cc 92.5 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 97.7 µs ± 469 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 109 µs ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 119 µs ± 6.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 146 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 211 µs ± 2.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 347 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 624 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.17 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.25 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.43 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 8.51 ms ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 16.9 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 33.7 ms ± 7.64 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 201 µs ± 234 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 285 µs ± 465 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 287 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 287 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 287 µs ± 761 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 347 µs ± 399 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 675 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.34 ms ± 643 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 4.82 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10.7 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 20.3 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 39.4 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 78.8 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 153 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 285 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 541 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.03 s ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.97 s ± 8.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 3.81 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Fixes https://github.com/pytorch/pytorch/issues/33647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33772 Differential Revision: D20112531 Pulled By: ngimel fbshipit-source-id: f90e3ef1b5be8276851637f3e1251cb8f1af411f

view details

Eli Uriegas

commit sha 93e30c16cb4ae3723e550daf522a3b6cf19f6b4e

.circleci: Switch to using robot token for conda uploads (#33786) Summary: Thanks to pjh5 for continued use of his account to upload binaries but I think we can start using a bot account now for this. Just a draft until we can ensure the env variables get injected correctly and the token can actually upload Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/33786 Differential Revision: D20122423 Pulled By: seemethere fbshipit-source-id: 0444584831a40ae730325d258935f6d1b873961b

view details

Wojciech Baranowski

commit sha 8aa09de19e125f3ea165e0030dccc86c21583c69

build: set -DNDEBUG in Release (#32719) Summary: This might lead to silent undefined behaviour (e.g. with out-of-bound indices). This affects `test_multinomial_invalid_probs_cuda` which is now removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32719 Test Plan: * Build with VERBOSE=1 and manually inspect `less ndebug.build.log | grep 'c++' | grep -v -- -DNDEBUG` (only with nina on Linux) * CI Fixes https://github.com/pytorch/pytorch/issues/22745 Differential Revision: D20104340 Pulled By: yf225 fbshipit-source-id: 2ebfd7ddae632258a36316999eeb5c968fb7642c

view details

Will Feng

commit sha 5c33d98b0d0bf53ca8faa5f6f53b33462f75b72a

Add assert_tensor_equal and assert_tensor_not_equal to test/cpp/api/support.h (#30426) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30426 This PR adds `assert_tensor_equal` and `assert_tensor_not_equal` to `test/cpp/api/support.h`, as better functions for testing whether two tensors are equal / not equal. Test Plan: Imported from OSS Differential Revision: D18695900 Pulled By: yf225 fbshipit-source-id: c19b9bc4c4e84d9f444015023649d27618fcbdf5

view details

Jerry Zhang

commit sha c32fa465a556a48958d9e226403dafaf5964db8a

Preserve Backward compatibility of models serialized before #31040 (#33796) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33796 Test Plan: Imported from OSS Differential Revision: D20109662 Pulled By: jerryzh168 fbshipit-source-id: 9bc936a59fd6dd1031fbf05eb90f98ae9677b936

view details

push time in a month

push eventduncanriach/pytorch

Jerry Zhang

commit sha a13ee1898207dbbd4e89ddd300707f087a84d12c

[quant][graphmode] refactor nodeQuantizable (#33171) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33171 For better code reuse Test Plan: . Imported from OSS Differential Revision: D20087845 fbshipit-source-id: f88cffb410bd54a1b3f937786104f46bcd1190d3

view details

Jerry Zhang

commit sha 86673791335569fe851c49202b715767f8dc302f

[quant][graphmode][refactor] Factor out insertDequantCall (#33172) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33172 For code reuse Test Plan: . Imported from OSS Differential Revision: D20087842 fbshipit-source-id: 797868d31b96c4ff8640121ea4bee1396deb6b57

view details

Haixin Liu

commit sha 038ee01393734575f26eb225ea03461170800232

Disable printing of the histogram when dump (#33749) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33749 Disable printing of the histogram when dump to make the log cleaner. Test Plan: CI Reviewed By: amylittleyang Differential Revision: D20087735 fbshipit-source-id: 5421cd9d25c340d92f29ce63fed2a58aefef567d

view details

joerg-de

commit sha 5bac7febad036f8a9e124036f6c2509a540dd588

removed padding and dilation from LPPool2d Doc (#33714) Summary: removed padding and dilation from LPPool2d Doc as the function dose not support padding and dilation Pull Request resolved: https://github.com/pytorch/pytorch/pull/33714 Differential Revision: D20097021 Pulled By: ngimel fbshipit-source-id: fc1c2d918b32f4b45c7e6e6bd93f018e867a628f

view details

Stas Bekman

commit sha 9a5ea713804a18c927baeed4f6f93dce753166ae

pad_packed_sequence: doc improvement (#33768) Summary: pad_packed_sequence: 1. clarify that batch's order is restored to the original one 2. add example This is a follow up to https://github.com/pytorch/pytorch/issues/33746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33768 Differential Revision: D20102792 Pulled By: ngimel fbshipit-source-id: 5ef511e5e3833edcb85cc01af0e92568b6d7a3cf

view details

Emilio Castillo

commit sha a836c4ca78b72ecc8e0664e1b684af64ce83be42

Skip manual backward for `cdist` with case `p=2` (#31167) Summary: Fixes an issue with `cdist` backward calculation for large inputs for the euclidean case. The grid size when launching the kernel exceeded the 2^16 limit for the second dimension, resulting in `RuntimeError: CUDA error: invalid configuration argument` Code to reproduce: ``` h, w, d = 800, 1216, 12 n = 133 A = torch.randn(n, d).cuda() B = torch.randn(h, w, d).cuda() A.requires_grad = True B.requires_grad = True B = B.reshape(-1, d).contiguous() dist = torch.cdist(A, B) loss = dist.sum() loss.backward() ``` Thanks to tkerola for the bug report, reproduction and suggesting a solution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31167 Differential Revision: D20035605 Pulled By: ngimel fbshipit-source-id: ae28ba4b549ee07a8bd937bb1de2438dc24eaa17

view details

Michael Carilli

commit sha fc6a1536889cec61fd9e9ebad71e044ec222a1be

[WIP] Reanimate gradient scaling API with original scale update heuristic (#33366) Summary: Also, windows memory failures responsible for the earlier reversion have been fixed. This PR (initially) contains 2 commits: * a revert of the revert * all changes to implement the original Apex scale update heuristic, squashed into a single commit for easier diff review Pull Request resolved: https://github.com/pytorch/pytorch/pull/33366 Differential Revision: D20099026 Pulled By: ngimel fbshipit-source-id: 339b9b6bd5134bf055057492cd1eedb7e4461529

view details

Martin Yuan

commit sha 758ad516f32708de243f194144ee0f7b9e0f5117

[Lite interpreter] Pass shared_ptr properly (#33667) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33667 Pass shared_ptr properly according to C++ guidances. Thank kimishpatel for pointing it out. Test Plan: Imported from OSS Differential Revision: D20111001 Pulled By: iseeyuan fbshipit-source-id: 213a0f950a7f3b9199d789dc0155911f6102d77a

view details

Ahmad Salim Al-Sibahi

commit sha 24659d28a1c92fed181f4971ec6d2883fbc06046

Feature/vonmises upstream (#33418) Summary: Third try of https://github.com/pytorch/pytorch/issues/33177 😄 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33418 Differential Revision: D20069683 Pulled By: ezyang fbshipit-source-id: f58e45e91b672bfde2e41a4480215ba4c613f9de

view details

Xiao Wang

commit sha c1dd70688a28d2bc91ac1e10dcda62d4c7bbebce

Fix deprecated python "add" calls (#33428) Summary: This PR fixed those python "add" calls using deprecated signature `add(Scalar, Tensor)`. The alternative signature `add(Tensor, alpha = Scalar)` is used. cc csarofeen zasdfgbnm ptrblck ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/33428 Differential Revision: D20002534 Pulled By: vincentqb fbshipit-source-id: 81f2dd6170a47a9b53a17e5817c26e70d8afa130

view details

Hong Xu

commit sha f87b0b2515bc19bc2df33d2a7a3b37695f73ea08

Remove the use of macros in defining binary ops for base Vec256 (#33733) Summary: This greatly improves readability and maintainability (e.g., debugging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/33733 Differential Revision: D20103187 Pulled By: ezyang fbshipit-source-id: e539e46f5d378a2b01da7ecaa6b850655e0fa866

view details

Peter Bell

commit sha 2eb95d8f4a1717cc93b1c63a02d1d51117bfde88

Migrate `fmod` and `fmod_` from TH to ATen (CPU) (#33592) Summary: Closes https://github.com/pytorch/pytorch/issues/24701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33592 Differential Revision: D20043875 Pulled By: ezyang fbshipit-source-id: b8c0a4e73a3cef6e55e91bbd35f8aadca8114c56

view details

Daya Khudia

commit sha a8e7ed48f469e561d89f724fcbf290eeb3967eb5

[pt][quant] Parallelize quantize and dequantize (#33765) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33765 quantize and dequantize methods now make use of multiple threads. This makes use of shz0116's recent parallelization of quantize/dequantize routines in FBGEMM. Fixes: https://github.com/pytorch/pytorch/issues/32006 https://github.com/pytorch/FBGEMM/issues/142 Alternative to https://github.com/pytorch/pytorch/pull/30153 ``` #!/usr/bin/env python import time import torch import torch.nn as nn torch.set_num_threads(4) # print(torch.__config__.parallel_info()) W = torch.rand(1, 54, 54, 256) NITER = 1000 s = time.time() for i in range(NITER): W_q = torch.quantize_per_tensor(W, scale=1.0, zero_point = 0, dtype=torch.quint8) time_per_iter = (time.time() - s) / NITER print('quantize time per iter ms', time_per_iter * 1000) s = time.time() for i in range(NITER): W_deq = W_q.dequantize() time_per_iter = (time.time() - s) / NITER print('dequantize time per iter ms', time_per_iter * 1000) ``` ### With 1 thread quantize time per iter ms 0.22633790969848633 dequantize time per iter ms 0.6573665142059326 ### With 4 threads quantize time per iter ms 0.0905618667602539 dequantize time per iter ms 0.19511842727661133 ghstack-source-id: 98935895 Test Plan: python test/test_quantized.py Reviewed By: jspark1105 Differential Revision: D20098521 fbshipit-source-id: bd8c45761b4651fcd5b20b95759e3868a136c048

view details

Michela Paganini

commit sha b8f0acf50f97f130a637592dc9dccbfef204c42b

Fix examples with updated pruning naming convention (#33144) Summary: Fix in docs requested by vainaijr. Closes issue https://github.com/pytorch/pytorch/issues/32991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33144 Differential Revision: D20104640 Pulled By: albanD fbshipit-source-id: 9b1be2c1cbde1964967967a9581bb6932a305d81

view details

Barak Nehoran

commit sha f597ac6efc70431e66d945c16fa12b767989b032

Fix grid_sample gradients at image borders (#32829) Summary: Fixes https://github.com/pytorch/pytorch/issues/23925 This fixes the incorrect gradients returned by `F.grid_sample` at image borders under `"border"` and `"reflection"` padding modes. At nondifferentiable points, the choice of which gradient to return among its super- or subgradients is rather arbitrary and generally does not affect training. Before this change, however, a bug in the code meant that the gradient returned at the exact borders was not selected from among the super- or subgradients. The gradient is now set to zero at the borders, which is a defensible choice for both the `"border"` and `"reflection"` padding modes: * For `"border"` padding, this effectively means that the exact borders of the image are now considered out of bounds, and therefore receive zero gradient. * For `"reflection"` padding, this effectively treats the exact borders as extrema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32829 Differential Revision: D20118564 Pulled By: soumith fbshipit-source-id: ef8571ff585be35ab1b90a922af299f53ab9c095

view details

Gao, Xiang

commit sha 45e4b614d1c8b529967b7f039677853d73911076

Per channel quantization performance improvement (#33772) Summary: Benchmark: NVIDIA GTX 1650 + AMD Ryzen Threadripper 3970X ```python import torch print(torch.__version__) for i in range(1000): torch.randn(1024 * 128, device='cuda') def cuda(e): a = torch.randn(2 ** e, 32, device='cuda') s = torch.randn(32, device='cuda') z = torch.randn(32, device='cuda') torch.cuda.synchronize() %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999); torch.cuda.synchronize() def cpu(e): a = torch.randn(2 ** e, 32, device='cpu') s = torch.randn(32, device='cpu') z = torch.randn(32, device='cpu') %timeit torch.fake_quantize_per_channel_affine(a, s, z, 1, -999, 999); for i in range(10, 24): cuda(i) print() for i in range(10, 32): cpu(i) ``` Before ``` 1.5.0a0+9bc922d 849 µs ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 817 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 814 µs ± 2.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.11 ms ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.19 ms ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.6 ms ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.44 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.14 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.41 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 13.9 ms ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 26.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 52.6 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 104 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 207 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 249 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 420 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 766 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.45 ms ± 574 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.84 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.69 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.29 ms ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 7.32 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 17.4 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 47.5 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 187 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 379 ms ± 5.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 652 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.22 s ± 4.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 2.34 s ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 4.56 s ± 7.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 8.97 s ± 33.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 17.8 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 35.2 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` After ``` 1.5.0a0+a7ec8cc 92.5 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 97.7 µs ± 469 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 109 µs ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 119 µs ± 6.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 146 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 211 µs ± 2.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 347 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 624 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.17 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.25 ms ± 48.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.43 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 8.51 ms ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 16.9 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 33.7 ms ± 7.64 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 201 µs ± 234 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 285 µs ± 465 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 287 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 287 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 287 µs ± 761 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 347 µs ± 399 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 675 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.34 ms ± 643 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 4.82 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10.7 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 20.3 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 39.4 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 78.8 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 153 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 285 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 541 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.03 s ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.97 s ± 8.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 3.81 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Fixes https://github.com/pytorch/pytorch/issues/33647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/33772 Differential Revision: D20112531 Pulled By: ngimel fbshipit-source-id: f90e3ef1b5be8276851637f3e1251cb8f1af411f

view details

Eli Uriegas

commit sha 93e30c16cb4ae3723e550daf522a3b6cf19f6b4e

.circleci: Switch to using robot token for conda uploads (#33786) Summary: Thanks to pjh5 for continued use of his account to upload binaries but I think we can start using a bot account now for this. Just a draft until we can ensure the env variables get injected correctly and the token can actually upload Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/33786 Differential Revision: D20122423 Pulled By: seemethere fbshipit-source-id: 0444584831a40ae730325d258935f6d1b873961b

view details

Wojciech Baranowski

commit sha 8aa09de19e125f3ea165e0030dccc86c21583c69

build: set -DNDEBUG in Release (#32719) Summary: This might lead to silent undefined behaviour (e.g. with out-of-bound indices). This affects `test_multinomial_invalid_probs_cuda` which is now removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32719 Test Plan: * Build with VERBOSE=1 and manually inspect `less ndebug.build.log | grep 'c++' | grep -v -- -DNDEBUG` (only with nina on Linux) * CI Fixes https://github.com/pytorch/pytorch/issues/22745 Differential Revision: D20104340 Pulled By: yf225 fbshipit-source-id: 2ebfd7ddae632258a36316999eeb5c968fb7642c

view details

Will Feng

commit sha 5c33d98b0d0bf53ca8faa5f6f53b33462f75b72a

Add assert_tensor_equal and assert_tensor_not_equal to test/cpp/api/support.h (#30426) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30426 This PR adds `assert_tensor_equal` and `assert_tensor_not_equal` to `test/cpp/api/support.h`, as better functions for testing whether two tensors are equal / not equal. Test Plan: Imported from OSS Differential Revision: D18695900 Pulled By: yf225 fbshipit-source-id: c19b9bc4c4e84d9f444015023649d27618fcbdf5

view details

Jerry Zhang

commit sha c32fa465a556a48958d9e226403dafaf5964db8a

Preserve Backward compatibility of models serialized before #31040 (#33796) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33796 Test Plan: Imported from OSS Differential Revision: D20109662 Pulled By: jerryzh168 fbshipit-source-id: 9bc936a59fd6dd1031fbf05eb90f98ae9677b936

view details

push time in a month

push eventduncanriach/pytorch

Duncan Riach

commit sha 5d41967fdbcee2630065d606cd521a75095a1e93

Add info about nondeterminism on CUDA backend for various ops

view details

push time in a month

push eventduncanriach/pytorch

Duncan Riach

commit sha 8fde4dc9acc553d2dc05083d47d54f4ec177a50d

Add info about nondeterminism on CUDA backend for various ops

view details

push time in a month

push eventduncanriach/pytorch

Duncan Riach

commit sha 782c7007672864faad5523eb8a325b0c00bf4aeb

Add info about nondeterminism on CUDA backend for various ops

view details

push time in a month

push eventduncanriach/pytorch

Duncan Riach

commit sha 7218a9b24d9b425f0acc76c5564cdeb68f8db4d8

Add info about nondeterminism on CUDA backend for various ops

view details

push time in a month

push eventduncanriach/pytorch

Duncan Riach

commit sha e07eed926b9ac9164ab237dc958804d02d5fd12f

Add info about non-determinism on CUDA backend for various ops

view details

push time in a month

delete branch NVIDIA/tensorflow-determinism

delete branch : doc

delete time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 53ad7e763a9d0f9a18bbce5f699cca9ff1db4ce4

Add info about GPU-deterministic bilinear resizing

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 53ad7e763a9d0f9a18bbce5f699cca9ff1db4ce4

Add info about GPU-deterministic bilinear resizing

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 620a8bbfbca51e173318582572f6b8d69acf259b

Add info about GPU-deterministic bilinear resizing

view details

push time in a month

create barnchNVIDIA/tensorflow-determinism

branch : doc

created branch time in a month

issue commentNVIDIA/tensorflow-determinism

How can I get deterministic, back-propagatable upsampling?

Thanks for the follow-up, @louislbc. tf.keras.layers.UpSampling2D uses tf.image.resize, so both of these are exhibiting the same issue. Thank for sharing that Conv2DTranspose can be used as a work-around. That matches what I would expect.

In TF 2.x, tf.image.resize with method=ResizeMethod.BILINEAR, which is the default, will have deterministic back-prop in NGC (NVIDIA GPU Cloud) TensorFlow container version 20.03, releasing in March 2020. I will also be upstreaming this solution to stock TensorFlow as soon as possible. I'm not sure if I will be able to get this into stock TF 2.2, but it will most likely be in stock TF 2.3. For others who might be reading this, and who want to continue using the TF1 API, this feature will also be available in the TF1 version of the container, via tf.image.resize_bilinear

Deterministic back-prop withtf.keras.layers.UpSampling2D, will then also be available when interpolation='bilinear' is selected.

louislbc

comment created time in a month

pull request commentpytorch/pytorch

Enhance reproducibility documentation

Thanks. I'll add a commit to this PR for the documentation and continue simplifying a reproducer.

duncanriach

comment created time in a month

pull request commentpytorch/pytorch

Enhance reproducibility documentation

@ngimel and @kurtamohler, I think I've made a mistake regarding BCE Loss. I don't think that it's actually the source of non-determinism that I'm seeing. Would you still like to take the other suggested modifications in the pull-request?

Also, while you're here, it now appears that non-determinism is being introduced by torch.repeat_interleave. Is it possible that the back-prop reduction through the broadcast is using atomicAdd? I'm seeing non-determinism when repeats is greater than 4 when repeating into the innermost dimension and I'm seeing non-determinism when repeats is greater than 16 when repeating into other dimensions.

duncanriach

comment created time in a month

pull request commentpytorch/pytorch

Enhance reproducibility documentation

@ngimel and @kurtamohler, I could not find the use of atomic operations either. I can get back to you with a reproducer.

duncanriach

comment created time in a month

delete branch NVIDIA/tensorflow-determinism

delete branch : doc

delete time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 45816e7ffcffab806d1f89f1a566a025f3152796

Update for deterministic XLA reductions

view details

push time in a month

push eventduncanriach/tensorflow

A. Unique TensorFlower

commit sha a90fa384cc031905420daa81a553bda6db4cc7bd

[TF2:XLA] Lower/UpperBound ops for tf.searchsorted. PiperOrigin-RevId: 294774880 Change-Id: If6137584ba86507912d0581d611eff01bb327ebd

view details

Gunhan Gulsoy

commit sha f04915d2c83bc708c6b1ed33f8f7dfc391e0d2dd

Implement GetTempFilename for windows. Instead of an ad-hoc tempo filename creation logic in gcs_filesystem, use the one provided in platform/path. PiperOrigin-RevId: 294778438 Change-Id: Ib4dfb32c76bda697f9ccde12d9fdadb42a3e6e3e

view details

A. Unique TensorFlower

commit sha 52281ba252094fc201d2dbcb49c9c1fa9d17ad03

Introduce a memory leak detection utility. PiperOrigin-RevId: 294780979 Change-Id: I27b18224dbb49535beaa7ce81906f5686cebb7ef

view details

Jiho Choi

commit sha 76e77cf61c5e3fab34f1ff119f5fe4fa77be590d

Change DerivedXLineBuilder to accept the vector of event metadata and maintain dependency with other lines. PiperOrigin-RevId: 294781084 Change-Id: Ied1b11a4cdbb33a0b16282b867174e2048fd6904

view details

TensorFlower Gardener

commit sha d74394747a2e253f9e42ed9c5e0e44f628fd200d

Merge of b75a6222b82bb556f63f7a5a04cab45212ed30c6 PiperOrigin-RevId: 294781398 Change-Id: I3cc915dd058a9b1414a7885794e4b95522ea910c

view details

Yanhui Liang

commit sha 13655728cda68ce4d8eefb92124b3b2191991dce

Fix the timeout of nasnet test. PiperOrigin-RevId: 294782893 Change-Id: Iffc97ba7ecb2072fd6d42ba7d9923952b157452d

view details

George Karpenkov

commit sha 4f2afc07b26748f80a0de768fac81c7816410b44

[XLA/GPU] Adapt tree reduction pass to the new column reduction algorithm Use the fact that we now can reduce up to 4096 items deterministically without atomics in a single kernel launch. PiperOrigin-RevId: 294783585 Change-Id: Ie941c5adc990d130104f9cf924e97859695ce0eb

view details

Rick Chao

commit sha fc36231b872e10793163cb62eef907dfcf0e7cff

Fix multi_worker_callback_tf2_test test target by only running it with CPU. The test is not using GPU anyway. PiperOrigin-RevId: 294784694 Change-Id: I9e9d2f8db05160799ef64b43f0cc8b1a927637e0

view details

Andrew Audibert

commit sha 47940211fdf68f9422f93a0c0c08382d03bdd438

Enable op-level dataset determinism configuration for ParallelMap Users can control determinism at a per-op level by specifying `deterministic` when calling map(). The `deterministic` argument takes higher priority than the `experimental_deterministic` dataset option. PiperOrigin-RevId: 294786773 Change-Id: If89f87dbe2adb51aad79791aa3f18072132e74c6

view details

A. Unique TensorFlower

commit sha 2522a14a11f20a49cc473c4587fb3bef55403be5

Update ops-related pbtxt files. PiperOrigin-RevId: 294789924 Change-Id: I26834db79554b0d7de02db7f669b784bed5e711f

view details

A. Unique TensorFlower

commit sha 10a29d7a5029207c739ba51502b2a00e8067b01b

Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 294790443 Change-Id: Idf262e9b1e54f3e1b9cb7c8a55df86866f56f5d6

view details

A. Unique TensorFlower

commit sha ee6c34b5a9d3e743c0f9f00fa2a6b18555ee2981

Automated rollback of commit a90fa384cc031905420daa81a553bda6db4cc7bd PiperOrigin-RevId: 294792062 Change-Id: I56c1915922822e2ddf2fd445fafd1e6590acba04

view details

Timon Van Overveldt

commit sha 596090262a846253f1c0a66fe81119de52141078

Revert to fnmatch-based FileSystem::Match on mobile, avoids APK size increase from RE2 dep. PiperOrigin-RevId: 294796880 Change-Id: I7f43104cbf4e261187204d91c41cfadb0098d5ed

view details

A. Unique TensorFlower

commit sha 76de04167b3871fd4e5c109f118fa536aea56337

give RemoteCallOp a unique name in TraceString (the function name). PiperOrigin-RevId: 294799569 Change-Id: I233a474fd92c2590e6ad450df0465664ac9fc815

view details

Frank Chen

commit sha f508aca7555e86f7c15b1caba19241ee8b9af426

Add absl/base/attributes.h to slow_operation_alarm.h as ABSL_MUST_USE_RESULT is used PiperOrigin-RevId: 294806806 Change-Id: Icae8bd5fb3a549ac544d62e2065baa4c55f9afd6

view details

Zhenyu Tan

commit sha cb70d1216ceb47d46a2226eb89c6e9de915c9759

API doc fix for model_to_estimator. PiperOrigin-RevId: 294808243 Change-Id: I337c7e07fdeb1cab8478b27424806c4188e2e419

view details

A. Unique TensorFlower

commit sha 568ee1a547505c443a81df984e0fb9b70a3488f0

Update ops-related pbtxt files. PiperOrigin-RevId: 294808525 Change-Id: Id1021e66cab2e56c6e6ddb5714496fc481618651

view details

George Karpenkov

commit sha ad526e38168afe1097191014a7ea045ee62bc767

[TF/XLA] Make an assumption that reference variables cannot occur inside tf.function Fixes #35874 PiperOrigin-RevId: 294810237 Change-Id: I322cfebb1e915c967dd0c73d52c5e3b9d0f9030b

view details

Mark Daoust

commit sha f00437d8118a28576f9d74eb3c74e09783014c57

Show raw ops pages, but hide them from search. PiperOrigin-RevId: 294811206 Change-Id: Ia37cb4337e5370aa68832fe02596da87988ce684

view details

Gunhan Gulsoy

commit sha 9afba0fc0d43972177fa7ddf250d348b5d17b829

Remove dependence on rand_r in xla/tests:reduce_test rand_r is not available on windows PiperOrigin-RevId: 294813530 Change-Id: I4ff19ac7b5831b54d0825ebff81c2d3ec80b3e16

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 45816e7ffcffab806d1f89f1a566a025f3152796

Update for deterministic XLA reductions

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 5a8253f621ef40456a04d62da518cea7ab6ada1d

Update for deterministic XLA reductions

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 4d5f35a7afe67b2e3bcf6be4bd2a95fe273dbeaa

Update for deterministic XLA reductions

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 0e7e2ebe2fd9a9c1a90a4fc9d32dd65594a94a1a

Update for deterministic XLA reductions

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha a46c275f4faccc9c177ef634f18ead32cdf7c66f

Update for deterministic XLA reductions

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 7a16e6efe447f67b61cd28311eaed53bfdaed2ae

Update for deterministic XLA reductions

view details

push time in a month

create barnchNVIDIA/tensorflow-determinism

branch : doc

created branch time in a month

delete branch duncanriach/tensorflow

delete branch : bias_add_test_eager_mode

delete time in a month

delete branch NVIDIA/tensorflow-determinism

delete branch : doc

delete time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 32bd04361cbc2c3b25e4d9e741b63e985d7d601d

Add reference to PyTotch PR 33795

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 32bd04361cbc2c3b25e4d9e741b63e985d7d601d

Add reference to PyTotch PR 33795

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 77fe6ae3ff3ebd3de464073cf6202b8bd8f5a0e1

Add reference to PyTotch PR 33795

view details

push time in a month

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 17f47ee3b183b1c9b11a6d3effa2e4a10311ac4e

Add reference to PyTotch PR 33795

view details

push time in a month

create barnchNVIDIA/tensorflow-determinism

branch : doc

created branch time in a month

PR opened pytorch/pytorch

Enhance reproducibility documentation.

Improves explanation of non-determinism when running on GPUs. Adds info about torch.nn.BCELoss operating non-deterministically on GPUs.

+26 -9

0 comment

1 changed file

pr created time in a month

create barnchduncanriach/pytorch

branch : enhance-reproducibility-documentation

created branch time in a month

fork duncanriach/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

https://pytorch.org

fork in a month

Pull request review commenttensorflow/tensorflow

Add info about TF_DETERMINISTIC_OPS to version 2.1 release notes

 TensorFlow 2.1 will be the last TF release supporting Python 2. Python 2 support   * Changes rebatching for `tf.data datasets` + distribution strategies for better performance.   Note that the dataset also behaves slightly differently, in that the rebatched dataset cardinality will always be a multiple of the number of replicas. * `TensorRT`   * [TensorRT 6.0](https://developer.nvidia.com/tensorrt#tensorrt-whats-new) is now supported and enabled by default. This adds support for more TensorFlow ops including Conv3D, Conv3DBackpropInputV2, AvgPool3D, MaxPool3D, ResizeBilinear, and ResizeNearestNeighbor. In addition, the TensorFlow-TensorRT python conversion API is exported as `tf.experimental.tensorrt.Converter`.+  * Environment variable `TF_DETERMINISTIC_OPS` added. When set to "true" or "1", this environment variable makes `tf.nn.bias_add` operate deterministically (i.e. reproducibly) when XLA JIT compilation is *not* enabled. It also makes cuDNN convolution and max-pooling operate deterministically. This makes Keras Conv*D and MaxPool*D layers operate deterministically in both the forward and backward directions when running on a CUDA-enabled GPU.

Nice. Thank so much, @cheshire!

duncanriach

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add info about TF_DETERMINISTIC_OPS to version 2.1 release notes

 TensorFlow 2.1 will be the last TF release supporting Python 2. Python 2 support   * Changes rebatching for `tf.data datasets` + distribution strategies for better performance.   Note that the dataset also behaves slightly differently, in that the rebatched dataset cardinality will always be a multiple of the number of replicas. * `TensorRT`   * [TensorRT 6.0](https://developer.nvidia.com/tensorrt#tensorrt-whats-new) is now supported and enabled by default. This adds support for more TensorFlow ops including Conv3D, Conv3DBackpropInputV2, AvgPool3D, MaxPool3D, ResizeBilinear, and ResizeNearestNeighbor. In addition, the TensorFlow-TensorRT python conversion API is exported as `tf.experimental.tensorrt.Converter`.+  * Environment variable `TF_DETERMINISTIC_OPS` added. When set to "true" or "1", this environment variable makes `tf.nn.bias_add` operate deterministically (i.e. reproducibly) when XLA JIT compilation is *not* enabled. It also makes cuDNN convolution and max-pooling operate deterministically. This makes Keras Conv*D and MaxPool*D layers operate deterministically in both the forward and backward directions when running on a CUDA-enabled GPU.

Awesome! Thanks for doing this and for letting me know, @cheshire. I plan to do the following:

  • Update (tensorflow-determinism)[https://github.com/NVIDIA/tensorflow-determinism] to reflect this feature enhancement (coming in TF 2.2).
  • Add info about this to the TF 2.2 release notes when that snaps.
  • Submit a PR (hopefully before the 2.2 snap) to (1) make TF_DETERMINISTIC_OPS cached and sticky, to exactly match its functionality elsewhere and (2) enable the deterministic bias_add (kernel test)[https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/python/kernel_tests/BUILD#L1680] for XLA.
duncanriach

comment created time in 2 months

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 34f08c613244df504fadae817c566a73af12c35a

Adjust formatting

view details

push time in 2 months

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 3dd181942068f27e6daf043686234ad2f363abac

Update for stock TensorFlow PR 33803 merged

view details

push time in 2 months

issue commentNVIDIA/tensorflow-determinism

Getting seq2seq to operate reproducibly

Hey @atebbifakhr, Sorry, I have not gotten to this yet. Will do as soon as I can and get back to you.

atebbifakhr

comment created time in 2 months

delete branch duncanriach/tensorflow

delete branch : multi-algorithm-deterministic-cudnn-convolutions

delete time in 2 months

pull request commenttensorflow/tensorflow

Enable tf.nn.bias_add python op tests to work in eager mode (as well as graph mode)

Hi @sanjoy, this PR has been ready to pull since January 10 (two months ago). Is there any way it can be moved closer to merge?

duncanriach

comment created time in 2 months

push eventNVIDIA/tensorflow-determinism

Duncan Riach

commit sha 22b2e1cb051d7adfd4f42d1d31a82701e92e35ed

Update patch docstring

view details

push time in 2 months

issue closedNVIDIA/tensorflow-determinism

Raise more specific exceptions

Some of the exceptions thrown should be more specific to enable specific exception handling. For example, if no patch is available, a NotImplementedError could be thrown here:

https://github.com/NVIDIA/tensorflow-determinism/blob/25d4b51006c765fbcec845baa50c59bbe8d14c01/tfdeterminism/patch.py#L75-L76

This allows users to prevent linter messages such as

Catching too general exception Exception

closed time in 2 months

bersbersbers

issue commentNVIDIA/tensorflow-determinism

Raise more specific exceptions

Closing.

bersbersbers

comment created time in 2 months

issue commentNVIDIA/tensorflow-determinism

Raise more specific exceptions

@bersbersbers, your request has been fulfilled by this commit.

I chose to use TypeError, rather than NotImplementedError, for both current cases. In the case where a patch is not available for a given version of TensorFlow, that might be either because it doesn't need a patch or because a patch is not available. The patch code cannot know which is which up-front.

bersbersbers

comment created time in 2 months

more