profile
viewpoint
Deven Desai deven-amd AMD Boxborough - MA Software Developer - AMD Deep Learning Frameworks Group

deven-amd/.emacs.d 0

emacs setup directory

deven-amd/benchmarks 0

A benchmark framework for Tensorflow

deven-amd/deep-learning 0

Repo for the Deep Learning Nanodegree Foundations program.

deven-amd/dlaicourse 0

Notebooks for learning deep learning

deven-amd/MIOpen 0

AMD's Machine Intelligence Library

deven-amd/mlir 0

"Multi-Level Intermediate Representation" Compiler Infrastructure

create barnchROCmSoftwarePlatform/tensorflow-upstream

branch : r1.15-rocm-deven-AMP

created branch time in 6 days

Pull request review commenttensorflow/benchmarks

Add a command line option to enable automatic mixed precision.

                     'different test runs. This flag is designed for human '                     'consumption, and does not have any impact within the '                     'system.')+flags.DEFINE_boolean('auto_mixed_precision', False,

You are right.

I had incorrectly assumed that setting that flag would also enable dynamic loss scaling. I have pushed out a commit with the requested change, please review.

thanks

deven-amd

comment created time in 6 days

push eventdeven-amd/benchmarks

Deven Desai

commit sha b0df038d58e46ef043f370ca2ebc173ddfeb1fef

enabling dynamic loss scaling when the option to enable automatic mixed precision is specified

view details

push time in 6 days

Pull request review commenttensorflow/benchmarks

Add a command line option to enable automatic mixed precision.

                     'different test runs. This flag is designed for human '                     'consumption, and does not have any impact within the '                     'system.')+flags.DEFINE_boolean('auto_mixed_precision', False,

As per documentation, dynamic loss scaling is also automatically applied when that option is enabled. From the page : https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite

Calling enable_mixed_precision_graph_rewrite(opt) enables the graph rewrite operation before computing gradients. The function additionally returns an Optimizer(opt) wrapped with a LossScaleOptimizer. This prevents underflow in the float16 tensors during the backward pass

deven-amd

comment created time in 7 days

PR opened tensorflow/benchmarks

Add a command line option to enable automatic mixed precision.

This PR/commit adds a command line option auto_mixed_precision to enable AMP (automatic mixed precision) for the tf_cnn_benchmarks. Based on the information given here : https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite

+6 -0

0 comment

1 changed file

pr created time in 7 days

push eventdeven-amd/benchmarks

Deven Desai

commit sha cce606c0cc5fac8bc800c4ed069a2c436882982b

Add a command line option to enable automatic mixed precision. This PR/commit adds a command line option `auto_mixed_precision` to enable AMP (automatic mixed precision) for the tf_cnn_benchmarks. Based on the information given here : https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite

view details

push time in 7 days

fork deven-amd/benchmarks

A benchmark framework for Tensorflow

fork in 7 days

push eventdeven-amd/scripts

Deven Desai

commit sha c8e1180a8ed7553f047082362b45d491083013a1

dlfwv20-1 update

view details

push time in 7 days

PR opened tensorflow/tensorflow

[ROCm] re-enable the test //tensorflow/python:auto_mixed_precision_test_gpu on ROCm

This PR is to re-enable the AMP unit test on the ROCm platform.


/cc @whchung @chsigg @nvining-work

+16 -7

0 comment

2 changed files

pr created time in 8 days

push eventROCmSoftwarePlatform/tensorflow-upstream

frreiss

commit sha b6c5a9fb8e12e42681f4317f72ce52872653497c

Add soft-link to pylintrc to project root

view details

ANSHUMAN TRIPATHY

commit sha 7375586f215385d1c29a6a9538ae658f5ae7b936

Contradicting comments removed

view details

ANSHUMAN TRIPATHY

commit sha 447c1420b8c2a7d9baf19248c97880b8ff832e0f

[1] Review comments handled

view details

TensorFlower Gardener

commit sha b83cc0ac8eb83c85b78778bbe8a0c96f323d747e

Merge pull request #22231 from MichaelKonobeev:sparse-xent-op-hessian PiperOrigin-RevId: 260802377

view details

MichaelKonobeev

commit sha ea809e3ad7c0d8a1fc1170dec6c782c7feac299b

Implement IsZero in eager mode

view details

Koan-Sin Tan

commit sha e305ac4b75a9523bf047fdaef75159f13bd04b86

[tflite] add int8 input/output to label_image More and more models, such as MobilenetV3's EdgeTPU ones, are using post-training full integer quantization. With this patch, I can get reasonable results. ./label_image_int8 -m mobilenet_edgetpu_224_1.0_int8.tflite Loaded model mobilenet_edgetpu_224_1.0_int8.tflite resolved reporter INFO: Initialized TensorFlow Lite runtime. invoked average time: 15.363 ms 0.867188: 653 military uniform 0.0390625: 835 suit 0.015625: 458 bow tie 0.0078125: 907 Windsor tie 0.00390625: 716 pickelhaube

view details

Koan-Sin Tan

commit sha b698e34e97ba49aa2d562a42804476ab5a024ab0

clean up

view details

Koan-Sin Tan

commit sha ec72fed7066f44d09172b3cfa358a299b7e5ec12

address review comments 1. add explicit cast back 2. change int to TfLiteType

view details

Koan-Sin Tan

commit sha ae2e9865a1ddfc782e9a41d89b59d4a7783c3f30

[tflite] bump SPLIT op ver from 1 to 3 in NNAPI delegate I need SPLIT op version 3. Since it's supported by TFLite and NNAPI 1.2. It's should be safe to bump the op version so that I can delegate SPLIT ops to accelerators.

view details

Duncan Riach

commit sha ed955df9438a4e13f33a439338c92cbc029a713d

Change bias_op tests to work in eager mode (as well as graph mode)

view details

Duncan Riach

commit sha c227f00a33de7ed10ade9bfb0ddce2833110a0e6

Fix Ubuntu Sanity CI fail due to pylint error

view details

Anuj Rawat

commit sha 5670f0f29c50dc2427f8cb4386aaeab6094083f1

Fixing test that fails on AVX512 The operation categorical_crossentropy requires taking log as an intermediate step. Due to the rank (2) and shape (3, 3) of the tensors used in this example, on AVX2 and older builds, the log operation uses plog, Eigen's packet log method, whereas on AVX512 build, the log operation is not vectorized and ends up using std::log. Due to the precision mismatch between std::log and Eigen's plog, the results do not match exactly. The loss values comes out to be equal to [0.10536055 0.8046685 0.06187541], instead of [0.10536055 0.8046684 0.06187541]. This is an expected mismatch and should not fail the test. The absolutely correct way to test would be to compare hex values and make sure that the results are within the expected range of the ULP error. An easier fix would be to reduce the precision of the test to account for such mismatches between the implementation of operators in the underlying math libraries. We are taking the second approach and will compare results after rounding to 5 decimal places.

view details

MichaelKonobeev

commit sha 6fe6391ea937a3c20308b3986f7232967e6f0268

Unconditionally tag zero tensors

view details

MichaelKonobeev

commit sha b187faf53c68ff9b0c711b246116fb81660ad4c7

Remove expired forward compatibility check

view details

MichaelKonobeev

commit sha cb9ce8a40c41d35725900f0f0e12a934e28ba837

Merge branch 'master' into sparse-xent-op-hessian

view details

Arvind Sundararajan

commit sha d28af41cf90fb85c91e09cddbb08b7ad43bf30d9

Handle indexed slice empty shapes in IndexedSlices gradients correctly.

view details

RichardXiao13

commit sha f8a15ce2b6f48523effe2dd42e7844ea7ef1d97a

Add usage example to math.poly_val

view details

RichardXiao13

commit sha 37b8d190b935d128c260c8a6acb871cd64748736

Update math_ops.py

view details

Richard Xiao

commit sha 3a63696e3b417603830131f989865a6f5b141482

Update math_ops.py

view details

William-Yin123

commit sha 691b55ff4d66e27f5a669d6955268f86627454af

committing updated docs

view details

push time in 8 days

push eventdeven-amd/scripts

Deven Desai

commit sha 80b70720c645419b6536eec876eb2b79e6336cd2

ixt-rack-04 update

view details

Deven Desai

commit sha 9f1d5bd89e4f1e38bdeb1f59df8c9ddc2d8137c4

Merge branch 'master' of https://github.com/deven-amd/scripts

view details

push time in 8 days

push eventROCmSoftwarePlatform/tensorflow-upstream

frreiss

commit sha b6c5a9fb8e12e42681f4317f72ce52872653497c

Add soft-link to pylintrc to project root

view details

ANSHUMAN TRIPATHY

commit sha 7375586f215385d1c29a6a9538ae658f5ae7b936

Contradicting comments removed

view details

ANSHUMAN TRIPATHY

commit sha 447c1420b8c2a7d9baf19248c97880b8ff832e0f

[1] Review comments handled

view details

Koan-Sin Tan

commit sha ae2e9865a1ddfc782e9a41d89b59d4a7783c3f30

[tflite] bump SPLIT op ver from 1 to 3 in NNAPI delegate I need SPLIT op version 3. Since it's supported by TFLite and NNAPI 1.2. It's should be safe to bump the op version so that I can delegate SPLIT ops to accelerators.

view details

Duncan Riach

commit sha ed955df9438a4e13f33a439338c92cbc029a713d

Change bias_op tests to work in eager mode (as well as graph mode)

view details

Duncan Riach

commit sha c227f00a33de7ed10ade9bfb0ddce2833110a0e6

Fix Ubuntu Sanity CI fail due to pylint error

view details

Anuj Rawat

commit sha 5670f0f29c50dc2427f8cb4386aaeab6094083f1

Fixing test that fails on AVX512 The operation categorical_crossentropy requires taking log as an intermediate step. Due to the rank (2) and shape (3, 3) of the tensors used in this example, on AVX2 and older builds, the log operation uses plog, Eigen's packet log method, whereas on AVX512 build, the log operation is not vectorized and ends up using std::log. Due to the precision mismatch between std::log and Eigen's plog, the results do not match exactly. The loss values comes out to be equal to [0.10536055 0.8046685 0.06187541], instead of [0.10536055 0.8046684 0.06187541]. This is an expected mismatch and should not fail the test. The absolutely correct way to test would be to compare hex values and make sure that the results are within the expected range of the ULP error. An easier fix would be to reduce the precision of the test to account for such mismatches between the implementation of operators in the underlying math libraries. We are taking the second approach and will compare results after rounding to 5 decimal places.

view details

Arvind Sundararajan

commit sha d28af41cf90fb85c91e09cddbb08b7ad43bf30d9

Handle indexed slice empty shapes in IndexedSlices gradients correctly.

view details

Lamar

commit sha 71dd20a99530f22c86a987088484db8f4f227e52

fixed static sized arrays with variable length using const int or int for the size of an array implies that it has variable length (ill-formed, https://en.cppreference.com/w/cpp/language/ub), static arrays' lengths should be constexpr or a macro constant

view details

Duncan Riach

commit sha 4ea10c4bcc1ca3d98e34c6742220c2c8fe9df946

Fix Ubuntu Sanity CI fail due to pylint error

view details

Eugene Kuznetsov

commit sha af54994072bda083229fd11cb2b1d58e2cd38ab0

Implementing GpuManagedAllocator for ROCm Enabling several common runtime unit tests for ROCm

view details

Eugene Kuznetsov

commit sha ae0e325a9fd53f2981bc569a2e3f8699c72a2ddc

Fixing ROCm LSTM and GRU v2 test

view details

Koan-Sin Tan

commit sha d768d147870b202559878c610c366a0ac536a748

[tflite] enable INT8 for Java binding some models created by full-integer post training quantization, e.g., the mobilenet v3 edgetpu one [1], have INT8 input and output tensors. [1] https://storage.cloud.google.com/mobilenet_edgetpu/checkpoints/mobilenet_edgetpu_224_1.0.tgz

view details

Gaurav Singh

commit sha 86374c2b0623ae3295ac3eb5d89f0fc95ba80bd0

[lite] pass array_names by const ref

view details

Tamas Bela Feher

commit sha 9cd49359bc3179172bfe5a7d1a50635fe550862c

Set input_shapes attribute for TrtEngineOp

view details

Tamas Bela Feher

commit sha 2f26032e77b6d0e4ac3ae632a079a3aafe4307c4

Enable tensors with unknown dimension in explicit_batch_mode

view details

Tamas Bela Feher

commit sha 0857ed43b101850eec7a4e57e8f95dbb76cd0ee1

Use input_shapes attribute in TrtEngineOp

view details

Tamas Bela Feher

commit sha 2afac1fd2f696f1a1bfc2772f4e6639240c92ca8

Test engine creation with dynamic input shapes

view details

Tamas Bela Feher

commit sha 63e14a845c4a174fc80c58b786379d36849d345c

Add input/output mask option to BuildParams

view details

Tamas Bela Feher

commit sha b62263abc46ed567e9e4794b9cb52c2631edb0da

Update explicit_batch_test test for unknown shapes

view details

push time in 8 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha 001195f341efa80ed7bdc3b450ca08a36fe15ccb

remove the no_rocm tag from //tensorflow/python:auto_mixed_precision_test_gpu

view details

Deven Desai

commit sha 9569f41eb1c8b6a5e8d3fba62d66ad936dddea88

Merge pull request #866 from ROCmSoftwarePlatform/develop-upstream-deven-amp remove the no_rocm tag from //tensorflow/python:auto_mixed_precision_test_gpu

view details

push time in 9 days

PR opened tensorflow/tensorflow

[ROCm] Workaround for a known CPU/GPU kernel interface args passing bug

I am working on submitting an update to Eigen to enable fp16 packet optimization for the ROCm platform. When testing those updates with TF, they result in the manifestation of a known bug in the CPU/GPU kernel interface args passing bug.

Basically the CPU code (.cc files) is compiled with an old version of gcc (v5.4, i.e. the that comes default with Ubuntu 16.04) and the GPU code (.cu.cc files) is compiled with HCC (which is clang10 based). This results in the GPU kernel arguments sometimes getting corrupted malformed. We do not have a fix for the issue, short of using a newer gcc version which does not have this "bug". We have discovered that packing all the arguments within a struct seems to workaround the bug, and that is what this PR does.


/cc @whchung @chsigg @nvining-work

+17 -9

0 comment

3 changed files

pr created time in 11 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

Updating CI scripts to run "large" tests, and re-classing some tests as "large"

upstream PR : https://github.com/tensorflow/tensorflow/pull/36762

deven-amd

comment created time in 11 days

PR opened tensorflow/tensorflow

[ROCm] Updating ROCm CI scripts to include large tests

This PR updates the ROCm CI scripts to include large tests.

On the ROCm platform, the CI scripts run tests with sharding disabled, in order to prevent the same test from running more than one subtest concurrently on the same GPU . Note that tests are still run in parallel (with number of tests in parellel == the number of GPUs available on the testing machine), just not on individual GPUs.

Because sharding is disabled (and all subtests within a given test execute serially on the GPU), some tests that have a lot of subtests, take much longer to run with the sharding disabled. Some of these tests were getting timed out, and need to re-classified as large (to raise their timeout limit).

There does not seem to be a way to conditionally specify the value of the size arg (i.e. large for ROCm, but medium for everybody else), and hence the change to large is unconditional.


/cc @whchung @chsigg @nvining-work

+15 -9

0 comment

7 changed files

pr created time in 11 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha c97e075b13c28d3a61c28abcd3e6c59c227124d5

re-classing the long running tests as "large"

view details

Deven Desai

commit sha 7d6a6ba25174338aca99295c847496fa44e2d7e8

Adding no_rocm tag to \"large\" tests that are failing on the ROCm platform. Note that these are not regressions per-se. These tests were not being run as part of ROCm CI tests, as only `small` and `medium` tests were being run by default. Now that we are in the process of adding \`large` tests to the ROCM CI run, we have un-covered these failures. The failures will be analysed and fixed if necessary in a separate PR.

view details

push time in 11 days

push eventdeven-amd/scripts

Deven Desai

commit sha 79f87c8ae90ea3f6a0bb75921dfddc3a82e98d2f

prj47-rack-15 update

view details

push time in 11 days

push eventdeven-amd/scripts

Deven Desai

commit sha 077ec1a2455b81de9b39fd795cce111407ac829b

prj47-rack-15 update

view details

Deven Desai

commit sha 5b718d5b534d897adddc7dfe9aa18cba1c1250df

Merge branch 'master' of https://github.com/deven-amd/scripts

view details

push time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM++#include "tensorflow/core/kernels/cwise_op_fma.h"+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"+#include "tensorflow/core/util/gpu_kernel_helper.h"++namespace tensorflow {++typedef Eigen::GpuDevice GPUDevice;++template <FMAType Type, typename T> __device__ T fma_op(T m1, T m2)+{+    if (Type==FMAType_Add)+      return m1 + m2;+    else if (Type==FMAType_Sub)+      return m1 - m2;+    else+      return m2 - m1;+}++//+// Fast pathway: each tensor is either full-size or 1-element.+//++template <typename T, int N, FMAType Type>

ah...got it

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ROCmFusionOpAddRelu::IsFusionEligible(const Node* relu, FusionOpData* d) {   return is_eligible; } +bool ROCmFusionOpFMA::IsFusionEligible(const Node* node, FusionOpData* d) {+//  bool is_eligible = false;+  auto tstr = node->type_string();+  if(tstr != "Add"+    && tstr != "AddV2"+    && tstr != "Sub")+    return false;+  VLOG(1)<<"Trying "<<tstr<<" "<<node->in_edges().size();++  if(node->in_edges().size()!=2)+    return false;+   // todo: can we reject if node output is 7+ dim?++  DataType dtype;+  TF_CHECK_OK(GetNodeAttr(node->def(), kAttr_T, &dtype));+  if(!(dtype==DT_HALF || dtype==DT_FLOAT || dtype==DT_DOUBLE))+    return false;++  Node* b, *c;+  TF_CHECK_OK(node->input_node(0, &b));+  TF_CHECK_OK(node->input_node(1, &c));++  if(!areAssignedToSameGpu({node,b,c}))+    return false;++  bool add = (tstr!="Sub");++  VLOG(1)<<b->type_string()<<" "<<b->in_edges().size()<<" "<<c->type_string()<<" "<<c->in_edges().size();+  bool can_absorb_b = (b->type_string()=="Mul" && b->in_edges().size()==2);+  // !IsInPreserveSet(*b) && (NumNonControlOutputs(*b, *ctx().node_map) == 1));+  bool can_absorb_c = (c->type_string()=="Mul" && c->in_edges().size()==2);+  //!IsInPreserveSet(*c) && (NumNonControlOutputs(*c, *ctx().node_map) == 1));++  if(can_absorb_b && can_absorb_c) {+      d->op_type = add ? "_FusedMulAdd2" : "_FusedMulSub2";+      d->op_name = strings::StrCat(b->name(), c->name());+      d->fusion_type = d->op_type;+      d->nodes.push_back(node);+      d->nodes.push_back(b);+      d->nodes.push_back(c);++      std::vector<const Edge*> b_input_edges;+      TF_CHECK_OK(b->input_edges(&b_input_edges));++      std::vector<const Edge*> c_input_edges;+      TF_CHECK_OK(c->input_edges(&c_input_edges));++      d->add_data_input(0, b_input_edges[0]->src(), b_input_edges[0]->src_output());+      d->add_data_input(1, b_input_edges[1]->src(), b_input_edges[1]->src_output());+      d->add_data_input(2, c_input_edges[0]->src(), c_input_edges[0]->src_output());+      d->add_data_input(3, c_input_edges[1]->src(), c_input_edges[1]->src_output());++      // populate the input control edges+      for (const Edge* e : b->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+      for (const Edge* e : c->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+

void FusionOpData::add_input_control(Node* node) too

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ROCmFusionOpAddRelu::IsFusionEligible(const Node* relu, FusionOpData* d) {   return is_eligible; } +bool ROCmFusionOpFMA::IsFusionEligible(const Node* node, FusionOpData* d) {+//  bool is_eligible = false;+  auto tstr = node->type_string();+  if(tstr != "Add"+    && tstr != "AddV2"+    && tstr != "Sub")+    return false;+  VLOG(1)<<"Trying "<<tstr<<" "<<node->in_edges().size();++  if(node->in_edges().size()!=2)+    return false;+   // todo: can we reject if node output is 7+ dim?++  DataType dtype;+  TF_CHECK_OK(GetNodeAttr(node->def(), kAttr_T, &dtype));+  if(!(dtype==DT_HALF || dtype==DT_FLOAT || dtype==DT_DOUBLE))+    return false;++  Node* b, *c;+  TF_CHECK_OK(node->input_node(0, &b));+  TF_CHECK_OK(node->input_node(1, &c));++  if(!areAssignedToSameGpu({node,b,c}))+    return false;++  bool add = (tstr!="Sub");++  VLOG(1)<<b->type_string()<<" "<<b->in_edges().size()<<" "<<c->type_string()<<" "<<c->in_edges().size();+  bool can_absorb_b = (b->type_string()=="Mul" && b->in_edges().size()==2);+  // !IsInPreserveSet(*b) && (NumNonControlOutputs(*b, *ctx().node_map) == 1));+  bool can_absorb_c = (c->type_string()=="Mul" && c->in_edges().size()==2);+  //!IsInPreserveSet(*c) && (NumNonControlOutputs(*c, *ctx().node_map) == 1));++  if(can_absorb_b && can_absorb_c) {+      d->op_type = add ? "_FusedMulAdd2" : "_FusedMulSub2";+      d->op_name = strings::StrCat(b->name(), c->name());+      d->fusion_type = d->op_type;+      d->nodes.push_back(node);+      d->nodes.push_back(b);+      d->nodes.push_back(c);++      std::vector<const Edge*> b_input_edges;+      TF_CHECK_OK(b->input_edges(&b_input_edges));++      std::vector<const Edge*> c_input_edges;+      TF_CHECK_OK(c->input_edges(&c_input_edges));++      d->add_data_input(0, b_input_edges[0]->src(), b_input_edges[0]->src_output());+      d->add_data_input(1, b_input_edges[1]->src(), b_input_edges[1]->src_output());+      d->add_data_input(2, c_input_edges[0]->src(), c_input_edges[0]->src_output());+      d->add_data_input(3, c_input_edges[1]->src(), c_input_edges[1]->src_output());++      // populate the input control edges+      for (const Edge* e : b->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+      for (const Edge* e : c->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+

yes. see example here https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/tensorflow/core/graph/gpu_fusion_pass.cc#L897

we probably want to add a void FusionOpData::add_output_control(Node* node) routine, and add calls to it everywhere in this file

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM++#include "tensorflow/core/kernels/cwise_op_fma.h"+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"+#include "tensorflow/core/util/gpu_kernel_helper.h"++namespace tensorflow {++typedef Eigen::GpuDevice GPUDevice;++template <FMAType Type, typename T> __device__ T fma_op(T m1, T m2)+{+    if (Type==FMAType_Add)+      return m1 + m2;+    else if (Type==FMAType_Sub)+      return m1 - m2;+    else+      return m2 - m1;+}++//+// Fast pathway: each tensor is either full-size or 1-element.+//++template <typename T, int N, FMAType Type>

that is strange...don't the template params need to be known at compile time by definition?

ekuznetsov139

comment created time in 12 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha 98149e6198f5faf72181b38a2516400f65a66474

updating ROCm CI scripts to include "large" tests

view details

Deven Desai

commit sha afabf2fae17bbb7313c0216283cb74a34a5c6dcc

re-classing the long running save/restore tests as "large"

view details

Deven Desai

commit sha 5ed6721ca0a46977ecbcabea22c6d2a24ad6213c

Adding no_rocm tag to \"large\" tests that are failing on the ROCm platform. Note that these are not regressions per-se. These tests were not being run as part of ROCm CI tests, as only `small` and `medium` tests were being run by default. Now that we are in the process of adding \`large` tests to the ROCM CI run, we have un-covered these failures. The failures will be analysed and fixed if necessary in a separate PR.

view details

Deven Desai

commit sha 2e335dbb0f84465dc3014f7813e2c4f60b09867c

Merge pull request #862 from ROCmSoftwarePlatform/develop-upstream-deven-add-ci-large Updating CI scripts to run "large" tests, and re-classing some tests as "large"

view details

push time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM++#include "tensorflow/core/kernels/cwise_op_fma.h"+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"+#include "tensorflow/core/util/gpu_kernel_helper.h"++namespace tensorflow {++typedef Eigen::GpuDevice GPUDevice;++template <FMAType Type, typename T> __device__ T fma_op(T m1, T m2)+{+    if (Type==FMAType_Add)+      return m1 + m2;+    else if (Type==FMAType_Sub)+      return m1 - m2;+    else+      return m2 - m1;+}++//+// Fast pathway: each tensor is either full-size or 1-element.+//++template <typename T, int N, FMAType Type>

can't you eliminate the execute routine altogther? i.e. wouldn't the following work

template<typename T, bool broadcast_x1, bool broadcast_y1, bool broadcast_x2, FMAType Type>
__global__ void CwiseFusedMulAddKernel(GpuLaunchConfig cfg, T* out, const T* x1,
                                       const T* y1, const T* x2) {
  GPU_1D_KERNEL_LOOP(i, cfg.virtual_thread_count) {
    out[i]=fma_op<Type>(x1[broadcast_x1 ? 0 : i] * y1[broadcast_y1 ? 0 : i],
             x2[broadcast_x2 ? 0 : i]);
  }
}

template <typename T, FMAType Type>
void LaunchFusedMulAddOp<GPUDevice, T, Type>::operator()(
    const GPUDevice& device, T* out, const T* x1, const T* y1, const T* x2,
    uint64 elements, bool broadcast_x1, bool broadcast_y1, bool broadcast_x2) {
   auto config = GetGpuLaunchConfig(elements, device);
  TF_CHECK_OK(GpuLaunchKernel(CwiseFusedMulAddKernel<T, broadcast_x1, broadcast_y1, broadcast_x2, Type>,
                              config.block_count, config.thread_per_block, 0,
                              device.stream(), config, out, x1, y1, x2));
}
ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 class ROCmFusionOpAddNReluGrad : public ROCmFusionOpBase {  //---------------------------------------------------------------------- +class ROCmFusionOpFMA : public ROCmFusionOpBase {+ public:+  ROCmFusionOpFMA(Graph* g) : ROCmFusionOpBase(g) {}++ protected:+  bool IsFusionEligible(const Node* n, FusionOpData* d) override;+};++//----------------------------------------------------------------------+ Status ROCmFusionPass::Run(const GraphOptimizationPassOptions& options) {   // enable the fusion pass if the env var TF_ROCM_FUSION_ENABLE is set-  if (ReadBoolFromEnvVar("TF_ROCM_FUSION_ENABLE")) {+  if (!ReadBoolFromEnvVar("TF_ROCM_FUSION_DISABLE")) {

Fusions are disabled by default, because Conv* based fusions are not guranteed to work. We do not know the conv parameters during this optimization, and MIOpen Fusion API (which is used to implement those fusion) can and will fail to "compile" certain fusions (atleast that was the case 18 months ago, when we implemented this). So the decision was to disable fusion by default, and enable them manually for cases where we knew we will not run into MIOpen failures.

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ROCmFusionOpAddRelu::IsFusionEligible(const Node* relu, FusionOpData* d) {   return is_eligible; } +bool ROCmFusionOpFMA::IsFusionEligible(const Node* node, FusionOpData* d) {+//  bool is_eligible = false;+  auto tstr = node->type_string();+  if(tstr != "Add"+    && tstr != "AddV2"+    && tstr != "Sub")+    return false;+  VLOG(1)<<"Trying "<<tstr<<" "<<node->in_edges().size();++  if(node->in_edges().size()!=2)+    return false;+   // todo: can we reject if node output is 7+ dim?++  DataType dtype;+  TF_CHECK_OK(GetNodeAttr(node->def(), kAttr_T, &dtype));+  if(!(dtype==DT_HALF || dtype==DT_FLOAT || dtype==DT_DOUBLE))+    return false;++  Node* b, *c;+  TF_CHECK_OK(node->input_node(0, &b));+  TF_CHECK_OK(node->input_node(1, &c));++  if(!areAssignedToSameGpu({node,b,c}))+    return false;++  bool add = (tstr!="Sub");++  VLOG(1)<<b->type_string()<<" "<<b->in_edges().size()<<" "<<c->type_string()<<" "<<c->in_edges().size();+  bool can_absorb_b = (b->type_string()=="Mul" && b->in_edges().size()==2);+  // !IsInPreserveSet(*b) && (NumNonControlOutputs(*b, *ctx().node_map) == 1));+  bool can_absorb_c = (c->type_string()=="Mul" && c->in_edges().size()==2);+  //!IsInPreserveSet(*c) && (NumNonControlOutputs(*c, *ctx().node_map) == 1));++  if(can_absorb_b && can_absorb_c) {+      d->op_type = add ? "_FusedMulAdd2" : "_FusedMulSub2";+      d->op_name = strings::StrCat(b->name(), c->name());+      d->fusion_type = d->op_type;+      d->nodes.push_back(node);+      d->nodes.push_back(b);+      d->nodes.push_back(c);++      std::vector<const Edge*> b_input_edges;+      TF_CHECK_OK(b->input_edges(&b_input_edges));++      std::vector<const Edge*> c_input_edges;+      TF_CHECK_OK(c->input_edges(&c_input_edges));++      d->add_data_input(0, b_input_edges[0]->src(), b_input_edges[0]->src_output());+      d->add_data_input(1, b_input_edges[1]->src(), b_input_edges[1]->src_output());+      d->add_data_input(2, c_input_edges[0]->src(), c_input_edges[0]->src_output());+      d->add_data_input(3, c_input_edges[1]->src(), c_input_edges[1]->src_output());++      // populate the input control edges+      for (const Edge* e : b->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+      for (const Edge* e : c->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+

I think control edges between nodes are used to establish an execution order dependency (even when there is no data-dependency). So if you have any control edges from either of the nodes being fused, you want to preserve them by adding them to the fused op.

ekuznetsov139

comment created time in 12 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

Updating CI scripts to run "large" tests, and re-classing some tests as "large"

re-based PR to resolve merge conflicts resulting from the weekly sync

deven-amd

comment created time in 12 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

+/* Copyright 2016 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM++#include "tensorflow/core/kernels/cwise_op_fma.h"+#include "tensorflow/core/kernels/cwise_ops_gpu_common.cu.h"+#include "tensorflow/core/util/gpu_kernel_helper.h"++namespace tensorflow {++typedef Eigen::GpuDevice GPUDevice;++template <FMAType Type, typename T> __device__ T fma_op(T m1, T m2)+{+    if (Type==FMAType_Add)+      return m1 + m2;+    else if (Type==FMAType_Sub)+      return m1 - m2;+    else+      return m2 - m1;+}++//+// Fast pathway: each tensor is either full-size or 1-element.+//++template <typename T, int N, FMAType Type>

would it be cleaner to have the three bools as template params instead? think that would make the code more readable and avoid the need to convert back and forth from N

something like the following here

template<typename T, bool broadcast_x1, bool broadcast_y1, bool broadcast_x2, FMAType Type>
...

and then call it like

   TF_CHECK_OK(GpuLaunchKernel(CwiseFusedMulAddKernel<T, broadcast_x1, broadcast_y1, broadcast_x2, Type>, 
...
ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 Status ArithmeticOptimizer::SimplifyArithmeticOps(bool can_use_shapes) {     pipeline.AddStage<RemoveStackStridedSliceSameAxis>(ctx, ctx_ext);   if (options_.fuse_squared_diff)     pipeline.AddStage<FuseSquaredDiffStage>(ctx, ctx_ext);+//  if (options_.fuse_mul_add)+//    pipeline.AddStage<FuseMulAddStage>(ctx, ctx_ext);

since we are no longer using them, please drop all the changes in this file.

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ROCmFusionOpAddRelu::IsFusionEligible(const Node* relu, FusionOpData* d) {   return is_eligible; } +bool ROCmFusionOpFMA::IsFusionEligible(const Node* node, FusionOpData* d) {+//  bool is_eligible = false;+  auto tstr = node->type_string();+  if(tstr != "Add"+    && tstr != "AddV2"+    && tstr != "Sub")+    return false;+  VLOG(1)<<"Trying "<<tstr<<" "<<node->in_edges().size();++  if(node->in_edges().size()!=2)+    return false;+   // todo: can we reject if node output is 7+ dim?++  DataType dtype;+  TF_CHECK_OK(GetNodeAttr(node->def(), kAttr_T, &dtype));+  if(!(dtype==DT_HALF || dtype==DT_FLOAT || dtype==DT_DOUBLE))+    return false;++  Node* b, *c;

nit: x1, y1, or perhaps input_1, input_2 instead of b, c ?

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ROCmFusionOpAddRelu::IsFusionEligible(const Node* relu, FusionOpData* d) {   return is_eligible; } +bool ROCmFusionOpFMA::IsFusionEligible(const Node* node, FusionOpData* d) {+//  bool is_eligible = false;+  auto tstr = node->type_string();+  if(tstr != "Add"

please add/use the isOp* methods, it is trivial but it helps to have code here follow similar code in this file

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ROCmFusionOpAddRelu::IsFusionEligible(const Node* relu, FusionOpData* d) {   return is_eligible; } +bool ROCmFusionOpFMA::IsFusionEligible(const Node* node, FusionOpData* d) {+//  bool is_eligible = false;+  auto tstr = node->type_string();+  if(tstr != "Add"+    && tstr != "AddV2"+    && tstr != "Sub")+    return false;+  VLOG(1)<<"Trying "<<tstr<<" "<<node->in_edges().size();

please use VLOG(kVlogLevel) instead.

comes in very handy when trying to debug when things to wrong with fusion. You can set kVlogLevel to -1, and that will result in only the VLOG messages in this file to show

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 class ROCmFusionOpAddNReluGrad : public ROCmFusionOpBase {  //---------------------------------------------------------------------- +class ROCmFusionOpFMA : public ROCmFusionOpBase {+ public:+  ROCmFusionOpFMA(Graph* g) : ROCmFusionOpBase(g) {}++ protected:+  bool IsFusionEligible(const Node* n, FusionOpData* d) override;+};++//----------------------------------------------------------------------+ Status ROCmFusionPass::Run(const GraphOptimizationPassOptions& options) {   // enable the fusion pass if the env var TF_ROCM_FUSION_ENABLE is set-  if (ReadBoolFromEnvVar("TF_ROCM_FUSION_ENABLE")) {+  if (!ReadBoolFromEnvVar("TF_ROCM_FUSION_DISABLE")) {

guessing the enable by default behaviour is for testing only? or is the intent to enable all fusions by default?

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ReadBoolFromEnvVar(const char* env_var_name) {  // graph pass grouping for this fusion pass const OptimizationPassRegistry::Grouping kROCmFusionPassGrouping =-    ReadBoolFromEnvVar("TF_ROCM_FUSION_PASS_POST_PARTITIONING")-        ? OptimizationPassRegistry::POST_PARTITIONING-        : OptimizationPassRegistry::POST_PLACEMENT;+    OptimizationPassRegistry::POST_PARTITIONING; // necessary for FMA fusion

please add a comment here indicating "why" POST_PARTITIONING is necessary for FMA fusion

ekuznetsov139

comment created time in 12 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 bool ROCmFusionOpAddRelu::IsFusionEligible(const Node* relu, FusionOpData* d) {   return is_eligible; } +bool ROCmFusionOpFMA::IsFusionEligible(const Node* node, FusionOpData* d) {+//  bool is_eligible = false;+  auto tstr = node->type_string();+  if(tstr != "Add"+    && tstr != "AddV2"+    && tstr != "Sub")+    return false;+  VLOG(1)<<"Trying "<<tstr<<" "<<node->in_edges().size();++  if(node->in_edges().size()!=2)+    return false;+   // todo: can we reject if node output is 7+ dim?++  DataType dtype;+  TF_CHECK_OK(GetNodeAttr(node->def(), kAttr_T, &dtype));+  if(!(dtype==DT_HALF || dtype==DT_FLOAT || dtype==DT_DOUBLE))+    return false;++  Node* b, *c;+  TF_CHECK_OK(node->input_node(0, &b));+  TF_CHECK_OK(node->input_node(1, &c));++  if(!areAssignedToSameGpu({node,b,c}))+    return false;++  bool add = (tstr!="Sub");++  VLOG(1)<<b->type_string()<<" "<<b->in_edges().size()<<" "<<c->type_string()<<" "<<c->in_edges().size();+  bool can_absorb_b = (b->type_string()=="Mul" && b->in_edges().size()==2);+  // !IsInPreserveSet(*b) && (NumNonControlOutputs(*b, *ctx().node_map) == 1));+  bool can_absorb_c = (c->type_string()=="Mul" && c->in_edges().size()==2);+  //!IsInPreserveSet(*c) && (NumNonControlOutputs(*c, *ctx().node_map) == 1));++  if(can_absorb_b && can_absorb_c) {+      d->op_type = add ? "_FusedMulAdd2" : "_FusedMulSub2";+      d->op_name = strings::StrCat(b->name(), c->name());+      d->fusion_type = d->op_type;+      d->nodes.push_back(node);+      d->nodes.push_back(b);+      d->nodes.push_back(c);++      std::vector<const Edge*> b_input_edges;+      TF_CHECK_OK(b->input_edges(&b_input_edges));++      std::vector<const Edge*> c_input_edges;+      TF_CHECK_OK(c->input_edges(&c_input_edges));++      d->add_data_input(0, b_input_edges[0]->src(), b_input_edges[0]->src_output());+      d->add_data_input(1, b_input_edges[1]->src(), b_input_edges[1]->src_output());+      d->add_data_input(2, c_input_edges[0]->src(), c_input_edges[0]->src_output());+      d->add_data_input(3, c_input_edges[1]->src(), c_input_edges[1]->src_output());++      // populate the input control edges+      for (const Edge* e : b->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+      for (const Edge* e : c->in_edges()) {+        if (e->IsControlEdge()) {+          d->control_inputs.push_back(e->src());+        }+      }+

add output control edges (coming out) from b and c?

ekuznetsov139

comment created time in 13 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha 49ba0198f5e816dfe76e6568e0c5826083185933

Adding no_rocm tag to \"large\" tests that are failing on the ROCm platform. Note that these are not regressions per-se. These tests were not being run as part of ROCm CI tests, as only `small` and `medium` tests were being run by default. Now that we are in the process of adding \`large` tests to the ROCM CI run, we have un-covered these failures. The failures will be analysed and fixed if necessary in a separate PR.

view details

push time in 13 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

Renabling some unit-tests that were no_rocm's after the 200129 weekly sync

upstream PR https://github.com/tensorflow/tensorflow/pull/36698 filed for the rocm_solver fix.

upstream PR https://github.com/tensorflow/tensorflow/pull/36341 updated with the other commits

deven-amd

comment created time in 13 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha bc54976882890031919bc277b1658650e06de63e

droppping the no_rocm tag for tests that will be fixed by PR 36698

view details

Deven Desai

commit sha b9f54fc1d3cb8f45ecef8372d96f03c04ad351d2

skipping subtests within //tensorflow/python/keras/layers:wrappers_test which tests features that are currently not supported by the MIOpen RNN implementation. Removing no_rocm tag from the test, since it will pass once the failing subtests are skipped

view details

push time in 13 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha 08d58a06af26c3be0c3ee84aef198901b7f2dab8

fixing a bug in the rocm_solver implementation, and removing the no_rocm tag from tests that pass as a result of the fix

view details

Deven Desai

commit sha 2435b149a61f924ef4bf1ddae2b1959789df8cd0

skipping subtests within //tensorflow/python/keras/layers:wrappers_test which tests features that are currently not supported by the MIOpen RNN implementation. Removing no_rocm tag from the test, since it will pass once the failing subtests are skipped

view details

Deven Desai

commit sha 854b76be2ee7ff870e2df7f1d17e2e2c595c3c55

Merge pull request #861 from ROCmSoftwarePlatform/develop-upstream-deven-unit-test-fix-200211 Renabling some unit-tests that were no_rocm's after the 200129 weekly sync

view details

push time in 13 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

Renabling some unit-tests that were no_rocm's after the 200129 weekly sync

jenkins : retest rocm-xla please

deven-amd

comment created time in 14 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

Renabling some unit-tests that were no_rocm's after the 200129 weekly sync

the flaky test //tensorflow/compiler/xla/tests:dynamic_ops_test_gpu strikes again in rocm-xla

http://ml-ci.amd.com:21096/job/develop-upstream-unit-tests/job/tensorflow-upstream-unit-tests-rocm-xla/1116/console

deven-amd

comment created time in 14 days

PR opened tensorflow/tensorflow

[ROCm] Fix for a test failure on the ROCm platform - 200211 - 1

The IR emitted by the AMDGPU backend, for the atomic-compare-and-swap operation, seems to have changed recently (perhaps by the new LLVM version being picked up as a recent frequent updates to the LLVM repo commit pointer)

The new IR results in a failure of a couple of subtests with the test //tensorflow/compiler/xla/service/gpu/tests:gpu_kernel_tiling_test.

Updating the regexp within the file-check pattern to account for the updated IR generation


/cc @whchung @jerryyin @cheshire

+6 -6

0 comment

1 changed file

pr created time in 14 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

Develop upstream sync 200210

@jerryyin let me take a peek

jerryyin

comment created time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 14 days

push eventdeven-amd/scripts

Deven Desai

commit sha b7c21912eb793cb48837d3f0cc391304cfe6144d

ixt-rack-04 update

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha 6e38ee4a1d7ac3222f894edfba3fd622a073ec41

Porting blfoat16 support to master-rocm-enhanced

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 14 days

push eventROCmSoftwarePlatform/tensorflow-upstream

TensorFlower Gardener

commit sha b83cc0ac8eb83c85b78778bbe8a0c96f323d747e

Merge pull request #22231 from MichaelKonobeev:sparse-xent-op-hessian PiperOrigin-RevId: 260802377

view details

MichaelKonobeev

commit sha ea809e3ad7c0d8a1fc1170dec6c782c7feac299b

Implement IsZero in eager mode

view details

Koan-Sin Tan

commit sha e305ac4b75a9523bf047fdaef75159f13bd04b86

[tflite] add int8 input/output to label_image More and more models, such as MobilenetV3's EdgeTPU ones, are using post-training full integer quantization. With this patch, I can get reasonable results. ./label_image_int8 -m mobilenet_edgetpu_224_1.0_int8.tflite Loaded model mobilenet_edgetpu_224_1.0_int8.tflite resolved reporter INFO: Initialized TensorFlow Lite runtime. invoked average time: 15.363 ms 0.867188: 653 military uniform 0.0390625: 835 suit 0.015625: 458 bow tie 0.0078125: 907 Windsor tie 0.00390625: 716 pickelhaube

view details

Koan-Sin Tan

commit sha b698e34e97ba49aa2d562a42804476ab5a024ab0

clean up

view details

Koan-Sin Tan

commit sha ec72fed7066f44d09172b3cfa358a299b7e5ec12

address review comments 1. add explicit cast back 2. change int to TfLiteType

view details

MichaelKonobeev

commit sha 6fe6391ea937a3c20308b3986f7232967e6f0268

Unconditionally tag zero tensors

view details

MichaelKonobeev

commit sha b187faf53c68ff9b0c711b246116fb81660ad4c7

Remove expired forward compatibility check

view details

MichaelKonobeev

commit sha cb9ce8a40c41d35725900f0f0e12a934e28ba837

Merge branch 'master' into sparse-xent-op-hessian

view details

RichardXiao13

commit sha f8a15ce2b6f48523effe2dd42e7844ea7ef1d97a

Add usage example to math.poly_val

view details

RichardXiao13

commit sha 37b8d190b935d128c260c8a6acb871cd64748736

Update math_ops.py

view details

Richard Xiao

commit sha 3a63696e3b417603830131f989865a6f5b141482

Update math_ops.py

view details

William-Yin123

commit sha 691b55ff4d66e27f5a669d6955268f86627454af

committing updated docs

view details

Richard Xiao

commit sha 5bb949619601f1b9d49a2d73b7d918103a1d421d

Update math_ops.py

view details

William-Yin123

commit sha 5fa409d7fb78d7ee91699808591f2532eac4bdbd

test

view details

William-Yin123

commit sha bf4433f8c784a29c091dbc23bc45c7f74c47e6c9

committing updated docs

view details

William-Yin123

commit sha eed98d69ac410750fd6789557544e27344af4638

removed ids and shortened lines

view details

William-Yin123

commit sha 8ceebbaa72df62b43201597a3bde6d1d4ca2077f

fixed string_hash_bucket usage example

view details

Richard Xiao

commit sha b0724d98219ae4c44cea51afab3ca004175b8f3f

Update math_ops.py

view details

William-Yin123

commit sha 52bba1f2e009354dbfd9a41882910b61b0b16fd7

fixed incorrect doctests

view details

MichaelKonobeev

commit sha 7be1d208ffbe4573f91d6d51d6dc3b88de0fe61d

Use make_decorator for tag_zeros_tensor

view details

push time in 14 days

pull request commenttensorflow/tensorflow

[ROCm] Support of GRU and LSTM

@qlzh727, please re-approve. previous CI run ran in to pylint errors. I have pushed out a commit to fix them. thanks

ekuznetsov139

comment created time in 15 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha 5f8828b16ed35dfdc448975443175829eb599ca3

fixing pylint errors

view details

push time in 15 days

pull request commenttensorflow/tensorflow

[ROCm] Unit-test updates for the ROCm platform.

@gbaned @chsigg gentle ping

deven-amd

comment created time in 15 days

pull request commenttensorflow/tensorflow

[ROCm] Fix the ROCm CSB breakage - 200110

@rthadur , gentle ping

deven-amd

comment created time in 15 days

pull request commenttensorflow/tensorflow

[ROCm] Reverting ROCm to use MIOpen Find Mode APIs (be default) for convolution

@gbaned, gentle ping

deven-amd

comment created time in 15 days

Pull request review commenttensorflow/tensorflow

[ROCm] Support of GRU and LSTM

 def lstm_with_backend_selection(inputs, init_h, init_c, kernel,     sequence_lengths: The lengths of all sequences coming from a variable length       input, such as ragged tensors. If the input has a fixed timestep size,       this should be None.-    zero_output_for_mask: Boolean, whether to output zero for masked timestep.+    +    put_for_mask: Boolean, whether to output zero for masked timestep.

that was be accident, change has been dropped

ekuznetsov139

comment created time in 15 days

Pull request review commenttensorflow/tensorflow

[ROCm] Support of GRU and LSTM

 def lstm_with_backend_selection(inputs, init_h, init_c, kernel,   }    def gpu_lstm_with_fallback(inputs, init_h, init_c, kernel, recurrent_kernel,-                             bias, mask, time_major, go_backwards, activation,-                             recurrent_activation, sequence_lengths):+                               bias, mask, time_major, go_backwards, activation,

indentation fixed

ekuznetsov139

comment created time in 15 days

Pull request review commenttensorflow/tensorflow

[ROCm] Support of GRU and LSTM

 def gru_with_backend_selection(inputs, init_h, kernel, recurrent_kernel, bias,   }    def gpu_gru_with_fallback(inputs, init_h, kernel, recurrent_kernel, bias,-                            mask, time_major, go_backwards, activation,-                            recurrent_activation, sequence_lengths):+                              mask, time_major, go_backwards, activation,

indentation fixed

ekuznetsov139

comment created time in 15 days

pull request commenttensorflow/tensorflow

[ROCm] Support of GRU and LSTM

@googlebot I consent

ekuznetsov139

comment created time in 15 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha d26ee9801c8117f7fd6297a05a82eab98023a2c3

bug fix in the ROCm python implementation for gpu_lstm op

view details

Deven Desai

commit sha eb713b7448f61c610850ded6113ee7eace764fd3

addressing code-review comments

view details

Deven Desai

commit sha f026e707d34db8c8f06e48eba699a2cb6bf2ecde

Skipping unsupported subtests within the tests lstm_test and gru_test

view details

push time in 15 days

pull request commenttensorflow/tensorflow

Fix concatenation_test

@jerryyin

should have given you a heads up on this one.

I filed a upstream PR last week for this issue : https://github.com/tensorflow/tensorflow/pull/36558

if only the TF guys were prompt in their PR handling, it would have never made it downstream to us!

jerryyin

comment created time in 15 days

PR opened tensorflow/tensorflow

[ROCm] Fix for a test failure on the ROCm platform - 200210 - 1

The test //tensorflow/python/distribute:mirrored_strategy_test_gpu currently fails on the ROCm platform. It has one failing subtest ( testFuctionPreservesAutoGraph (__main__.FunctionTest) ), which fails with the following error

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/distribute/mirrored_strategy_test_gpu.runfiles/org_tensorflow/tensorflow/python/distribute/mirrored_strategy_test.py", line 1369, in testFuctionPreservesAutoGraph
    [context.LogicalDeviceConfiguration()] * 2)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/distribute/mirrored_strategy_test_gpu.runfiles/org_tensorflow/tensorflow/python/framework/config.py", line 609, in set_logical_device_configuration
    context.context().set_logical_device_configuration(device, logical_devices)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/distribute/mirrored_strategy_test_gpu.runfiles/org_tensorflow/tensorflow/python/eager/context.py", line 1326, in set_logical_device_configuration
    "Virtual devices cannot be modified after being initialized")
RuntimeError: Virtual devices cannot be modified after being initialized

The failure does not seem to ROCm specific, and the error seems legit, so am not sure how to go about "fixing" it.

Skipping the subtest on the ROCm platform for now.


/cc @whchung @chsigg

+6 -0

0 comment

1 changed file

pr created time in 15 days

Pull request review commenttensorflow/tensorflow

[ROCm] Fix for a test regression on the ROCm platform - 200207 - 2

 def test_profile(self):     profile_pb.ParseFromString(profile_result)     devices = frozenset(device.name for device in profile_pb.devices.values())     self.assertIn('/host:CPU', devices)-    if config.list_physical_devices('GPU'):+    if not test.is_built_with_rocm() and config.list_physical_devices('GPU'):+      # device tracing is not yet supported on the ROCm platform       self.assertIn('/device:GPU:0', devices)     events = frozenset(event.name for event in profile_pb.trace_events)     self.assertIn('three_times_five', events)

@rthadur

the error you have mentioned

AssertionError: 'Mul:Mul' not found in frozenset({'ExecutorState::Process', 'three_times_five', 'mul:Mul', 'SessionRun', 'ExecutorDoneCallback'})

seems to be unrelated to the change in this PR.

The change in this PR merely skips the check on line 51/53 on the ROCm platform.

The assert that is firing is on line 56/58, and the reason seems to be a change in the event name ( mul:Mul vs Mul:Mul). That does not seem related to the change in this PR

deven-amd

comment created time in 15 days

PR opened tensorflow/tensorflow

Reviewers
[ROCm] Fix for a test regression on the ROCm platform - 200207 - 2

The following commit introduces a test regression on the ROCm platform https://github.com/tensorflow/tensorflow/commit/7a931a2349591f4e2250ac2d3b6c3ca66538b740

That commit adds an explicit check for GPU device in the profiler output (if a GPU is present in the list of physical devices).

Since ROCm platform does not yet support device tracing, this test now fails on the ROCm platform

The "fix" (until ROCm adds support for device tracing) is to disable that check on the ROCm platform


/cc @whchung @chsigg

+3 -1

0 comment

1 changed file

pr created time in 18 days

PR opened tensorflow/tensorflow

[ROCm] Fix for a test regression on the ROCm platform

The following commit introduces a test regression on the ROCm platform https://github.com/tensorflow/tensorflow/pull/36058/commits/a2aa5e3f045a5916b20a63b58f824ed59710845a

After the commit, the test fails to build with the following error.

ERROR: /root/tensorflow/tensorflow/lite/kernels/BUILD:846:1: Couldn't build file tensorflow/lite/kernels/_objs/concatenation_test/concatenation_test.o: C++ compilation of rule '//tensorflow/lite/kernels:concatenation_test' failed (Exit 1)
tensorflow/lite/kernels/concatenation_test.cc:276:26: error: expected ';' at end of member declaration
       std::is_same<Type, int16_t>::value ? TensorType_INT16 : TensorType_INT8;
                          ^
tensorflow/lite/kernels/concatenation_test.cc:276:33: error: expected unqualified-id before '>' token
       std::is_same<Type, int16_t>::value ? TensorType_INT16 : TensorType_INT8;
                                 ^
tensorflow/lite/kernels/concatenation_test.cc:276:20: error: wrong number of template arguments (1, should be 2)
       std::is_same<Type, int16_t>::value ? TensorType_INT16 : TensorType_INT8;
                    ^
In file included from /usr/include/c++/5/bits/move.h:57:0,
                 from /usr/include/c++/5/bits/stl_pair.h:59,
                 from /usr/include/c++/5/bits/stl_algobase.h:64,
                 from /usr/include/c++/5/memory:62,
                 from external/com_google_googletest/googletest/include/gtest/gtest.h:56,
                 from tensorflow/lite/kernels/concatenation_test.cc:17:
/usr/include/c++/5/type_traits:958:12: note: provided for 'template<class, class> struct std::is_same'
     struct is_same;
            ^

The fix is to put parens around the RHS expr, to fox what seems to be a parsing error. Don't think this error was ROCm specific.


/cc @whchung @cheshire @chsigg

+1 -1

0 comment

1 changed file

pr created time in 18 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Trent Lo

commit sha b33c788a2f479a4753f49b566f08079692c75af2

Implement horizontal fusion. - It reduces kernel launch overhead and increases lauch dims by horizontally fusing indepedent computations.

view details

Trent Lo

commit sha cd68827e01d454937399bafcdb1eb4b9a116678a

Minor cleanup for horizontal fusion.

view details

Trent Lo

commit sha cb9ab8bee96530c9973d5e295b53d936cbf8ef72

Polishing coding style and comments.

view details

Trent Lo

commit sha 1876f2acc02dee840b3a8b6ab59f950b5a3bbf4f

Factor out lambdas in HorizontalFusionImpl.

view details

Trent Lo

commit sha 86bd5bf3e75cb5d14d24194a2d1e2d8f60753b03

Comment polishing.

view details

Trent Lo

commit sha 474e79985f722afa57d12447fb2f4dc30e890d06

Add some more unittests for horizontal fusion. In addition, we record the execution time of the tests here, showing the optimization effects of horizontal fusion, measured by --xla_hlo_profile. The accumulated kernel execution time in GradientDescentOptimizerLike is reduced from 2.39ms to 311us; the execution time in RMSProp is reduced from 980us to 112us. Before horizontal fusion: 2019-12-10 22:05:45.215015: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (2.39 ms @ f_nom) 2019-12-10 22:05:48.877372: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (980 us @ f_nom) After horizontal fusion: 2019-12-10 22:05:03.831600: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for GradientDescentOptimizerLike: (311 us @ f_nom) 2019-12-10 22:05:13.513901: I tensorflow/compiler/xla/service/executable.cc:208] Execution profile for RMSPropLike: (112 us @ f_nom)

view details

Trent Lo

commit sha a629a452bff5b7f7f2688086483d7eb8d3d02420

Polishing comments and coding styles.

view details

Stephan Uphoff

commit sha 7813cb00f35d6fc6d8ad8421021c1535f3e8c029

lite/micro: Add feature buffer to micro_speech example. This fixes #35117 Accumulate feature slices in separate buffer. The input tensor is not suitable for keeping state across interference as it has limited lifetime and the buffer space may be reused.

view details

Trent Lo

commit sha 7abde726e4706df2fa83c2ec3c89ef9fb5c99228

Polish coding styles and comments based on review feedback. In addition, use hlo_matcher to verify resultant DAGs instead of LLVM filecheck.

view details

Trent Lo

commit sha 5f5aa78f86a43d073663cc0f96acb3926d621e42

Merge branch 'upstream_master_dec19' into horizontal_fusion_github

view details

Eugene Kuznetsov

commit sha 968a674ecb6db34e5d2e09068a8d9ca5ca4e3e24

Enable //tensorflow/python:stateful_random_ops_test

view details

Eugene Kuznetsov

commit sha f7b28191777b6ae86c0dbdab7a74b8370e53eaa8

Fix for //tensorflow/python:stateful_random_ops_test: Pack arguments of UpdateVariableAndFill_Philox into a struct

view details

Eugene Kuznetsov

commit sha eee5851777b842945b12937600b005a58aae0f2c

Fix for //tensorflow/python:stateful_random_ops_test: Move the thread counter into the global namespace

view details

Trent Lo

commit sha 47ba0995d9838e5f9aa634abc59f4569c4a37375

Fix a buildifier format issue.

view details

Trent Lo

commit sha eab6c5e84d44afbfd4e2b80c5dd59a6b090ed3bf

Do not fuse fusions with different output types. It is forbidden as the concatenate (inserted for horizontal fusion) requires the input operands to have the same type.

view details

Trent Lo

commit sha e09da4f3dc39efda5a8e68539bea894c88831143

Minor polishing.

view details

Trent Lo

commit sha e2989a34af44c39624cffa36cf319c66615c2483

Add a help function GetUniqueOutputTypeOfFusion. It is safer and clearer for getting the unique output type of fusion.

view details

Trent Lo

commit sha 88521dae35d13c7b25b826e5dd4da2f2d26d6013

Minor error message polishing.

view details

Trent Lo

commit sha 30b1943d43ac83fde73069d8546ab6a5c1e68372

Minor comment polishing.

view details

Harry Slatyer

commit sha 2aaa53e21f5d12c6de74a7d73525f9fc227b13bb

Support Identity in tensor_util.constant_value. This looks just the same as StopGradient, since for the purposes of forward-propagated values the two are identical.

view details

push time in 18 days

push eventdeven-amd/scripts

Deven Desai

commit sha 14ca7203f9968cd8fc3cbd79d970206ba9ed3029

prj47-rack-15 update

view details

push time in 18 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

[DO NOT MERGE] Enabling fp16 packet optimizations in Eigen

hmm, when I try out Eugene's example

the run with Find Mode

  • is hanging (3 hours and still in the first run) for fp16 mode.
  • took about 30 mins (for the first run) for fp32 mode.

If I switch to Immediate mode, I can see an improvement of about 10% with fp16. I get the same 10% improvement with and without the changes in this PR. That is kinda expected, since we have seen that Eigen kernels are typically memory bound, and enabling vector optimizations does not help their preformance.

As for this PR, I first need to upstream the workaround for LeakyRelu function. If that gets accepted, I will push out the PR to upstream the Eigen change. Once that gets accepted and shows up on the TF side, we can checkin the change to drop the extra header includes.


A tangential observation to this PR. @daniellowell

turns out that Immediate Mode rankings for the results returned are not necessarily accurate. In a fresh container (in which we have not yet this test in Find DB mode), if we run with

  • immediate mode enabled (set TF_ROCM_USE_IMMEDIATE_MODE=1)
  • instruct immediate mode API to return only the best algo (set TF_ROCM_RETURN_BEST_ALGO_ONLY=1) then fp32 run median time is approx 3.1 s

Now if you let immediate mode API return all algos and let TF code pick the best one by profiling them, (unset TF_ROCM_RETURN_BEST_ALGO_ONLY), then the fp32 run median time is approx 1.3x. This implies that the best algo returned by immediate mode might not be best.

Note that once you run with Find Mode enabled, this experiment will no longer reproduce the results above, since the Find Mode run will populate the DB with the correct best algo.

deven-amd

comment created time in 19 days

push eventdeven-amd/scripts

Deven Desai

commit sha 53b979752273d3b81e8c2d94923032262458d5ab

prj47-rack-15 update

view details

push time in 19 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Deven Desai

commit sha ee94775a8d9d74711cc0b3fe084df8d15392ce14

PR 782 - [DO NOT MERGE] Porting ROCm Fusion support to master-rocm-enhanced

view details

Deven Desai

commit sha 8dce746e39e0ab237f68315f8edc8d47954f4d4f

PR 783 - [DO NOT MERGE] Porting ROCm blfoat16 support to master-rocm-enhanced

view details

Deven Desai

commit sha d4d1fa316c816850a1f5fe90cb841b7460452191

PR 789 - [DO NOT MERGE] Porting ROCm hipclang support to mater-rocm-enhanced

view details

Deven Desai

commit sha c474ad1b84d6bec70b1ab375558ce39ffc303163

PR 790 - [DO NOT MERGE] Porting ROCm Dropout support to master-rocm-enhanced

view details

Deven Desai

commit sha c8d28d8a5fb4f76723cdef82630abc07eb021b71

PR 791 - [DO NOT MERGE] Porting ROCm githooks to the master-rocm-enhanced branch

view details

Deven Desai

commit sha 768665a0c89bf9523514bc5b0d36988ed4475638

PR 794 - [DO NOT MERGE] Porting ROCm docs to master-rocm-enhanced

view details

Deven Desai

commit sha 0fa2c852b7a594c9171731687a18c3bd310706d5

PR 795 - [DO NOT MERGE] Porting ROCm scripts to master-rocm-enhanced branch

view details

Deven Desai

commit sha c57be57c31efbe5ac47349099ca75ae2b92e9c27

PR 799 - [DO NOT MERGE] Porting ROCm batch_gemm support to master-rocm-enhanced

view details

Deven Desai

commit sha 7fb04f37eaf7562461bc7e27863470cb06f0c00d

PR 803 - [DO NOT MERGE] Porting 3d pooling support to master-rocm-enhanced

view details

push time in 19 days

push eventROCmSoftwarePlatform/tensorflow-upstream

BashirSbaiti

commit sha 403194f86321711d2c0800b47a77b0834d6238d3

Add usage example for tf.math.sigmoid Add a brief example showing how to use the sigmoid function in tf.math

view details

BashirSbaiti

commit sha e911c17326f11d507c7cfd05ae9fcaf385b3b597

Fix formatting Line length fix

view details

Deven Desai

commit sha 09e2eaf34227ef922d8e85b0caef2c0eb5749df5

[ROCm] Enabling vectorization/packet support for some cwise ops on the ROCm platform

view details

Deven Desai

commit sha d325b255ff7d0bf1ca04229880dffb0a37d52e2d

[ROCm] removing no_rocm tag from tests that are now passing on the ROCm platform Also disabling one subtest within the //tensorflow/core:constant_folding_test. That test requires GPU support for the "topK" to work as intended. ROCm platform currently does not have GPU support for the "topK" operator and hence the test fails on it.

view details

Deven Desai

commit sha 17b87f0b51ad290269f983a85b887ae838c2ebe2

[ROCm] disabling fast NHWC implementation for the ROCm platform.

view details

Deven Desai

commit sha 04fb568df083a1903dd4f061539b29b4a849fd18

[ROCm] Adding explicit error messages for DnnScratchAllocator failures

view details

BashirSbaiti

commit sha ad0424b0829cb100a8e42cb0888b808670bece2a

Doctest formatting

view details

Måns Nilsson

commit sha cd311a8656fd9827a8fd8abdfc99ea47fd0ead4e

Inherit CMSIS CCFLAGS when generating mbed project

view details

Peng Sun

commit sha 9e18cc421be4eebb4e68da3834775e7beaa69786

Add 16 bit activations support to kernel operator STRIDED_SLICE Enable kernel STRIDED_SLICE support for int16 activations. Add typed test for int16 reference kernel.

view details

Ashutosh Hathidara

commit sha 2ebe69ec8bc752194a56b2fee6b91f05897420e0

untracked files

view details

Ashutosh Hathidara

commit sha 856175bc3afaa07f6e41ad781962a69e82ca7648

Merge remote-tracking branch 'upstream/master'

view details

Ashutosh Hathidara

commit sha b779a4737a16b1b04a1180c4193187625047b101

stack dump detached

view details

Ashutosh Hathidara

commit sha d0d5632b8d00e7d0e6285a1637d10ac50fa266d0

Gradient doc changed

view details

Ashutosh Hathidara

commit sha 5b8aadb94633f5c36d892236ba249d0dd4325723

Example added

view details

comet

commit sha 23ff63d546800cd7143172bb201d6bb8d6fd15fd

update

view details

comet

commit sha f8f33dfc6ed8e7b528b3c2b05a9b8100aaa0d037

update

view details

jerryyin

commit sha 1f4186c64f76854fe26335729022b7dea4dec941

Disable test that invokes rocBlas TRSM

view details

Frederic Bastien

commit sha e14bd919f4190cda821f616c15952fdbb36242dc

[XLA] Better default SM for unknown SM.

view details

Anthony Barbier

commit sha 34f4de78c58bd802237ac327cf6296493d39ca85

Clear caches on eager context reset. Clear the caches of the existing eager context before deleting it as some of these caches are cpp static (TFE_TensorHandleCache) and will therefore be inherited by future contexts with some out of date content which might lead to some segfaults.

view details

Ehsan Toosi

commit sha a97d62e64274d492460e7c863e74cb7aee2fcdc2

[MLIR][XLA] Add HLO FuncOp to LHLO FuncOp legalization.

view details

push time in 19 days

push eventROCmSoftwarePlatform/tensorflow-upstream

Pooya Davoodi

commit sha 98661e2a3af5b062aa7a7190b83b4e4c9c3d49bf

Change TrtConversionParams to class from NamedTuple This change preserves backward compatibility of TF-TRT API.

view details

Pooya Davoodi

commit sha 8b9901e411ba9d3f7cf1fa4908452b4bfa51fe88

Remove unused import and do not use "is" for string comparison

view details

Pooya Davoodi

commit sha e8565092ff2ea40455c36a024f3220288f4c771d

Update golden API for TF-TRT backward compatibility test

view details

Pooya Davoodi

commit sha 3a378dbfd51f356326dfbac56ba1be4bc88f774e

Export ConversionParams an API

view details

Pooya Davoodi

commit sha 38be61cc482912d01f85977ba5d7d3ad45c1795f

Let _replace() return a copy of object instead of changing the object The idea is to make DEFAULT_TRT_CONVERSION_PARAMS immutable. Although this doesn't make it completely immutable but it avoids changing the object by _replace(), which is the method everyone is using on DEFAULT_TRT_CONVERSION_PARAMS when it was a NamedTuple.

view details

Pooya Davoodi

commit sha 69a6150f752ba0824a0713f42bec5ef78d31df27

Delete argument allow_build_at_runtime This was accidentally introduced. The argument will be added in https://github.com/tensorflow/tensorflow/pull/34919

view details

Pooya Davoodi

commit sha fbe5b54b7618ccc180fad718fc1353d2e6bcc684

Add __repr__ method to TrtConversionParams to have class properties in API pbtxt

view details

Elena Zhelezina

commit sha c3000c5ae3e547e6fe58608e759af75033c35402

Added 16/8 bit support to kernel operator CONCAT Added support for 16-bit activations and 8-bit weights quantization using the symmetric quantization scheme with zero point at 0.

view details

Deven Desai

commit sha cd1a77fd5e6438b825cc90bd34706a8edeea809a

fixing a bug in the previous commit to nccl_manager_test.cc file. Also updating the run_cc_core.sh file to not filter out the ncc_manager_test

view details

Elena Zhelezina

commit sha 102a07e4acb99c7d624adb19e70e25f29d8b1668

Reference kernel MUL, 16-bit: tests + implementation.

view details

Pooya Davoodi

commit sha ac5d19e1e195cb1dfe9a6e7932e3d41cda540b97

Add golden pbtxt file for TensorRT ConversionParams

view details

Pooya Davoodi

commit sha b1cd057295b081053428e791e99f4d259a0f9f95

Make TrtConversionParams immutable

view details

Pooya Davoodi

commit sha 7c4da2cd61ac14d888aaf5ee960d40aad5bb1779

Make conversion_params immutable by using None as its default value

view details

Pooya Davoodi

commit sha f86b8b15871555ab826196fd08e06b4114db24b1

Update goldens api of TRT converter after changing params default value

view details

Pooya Davoodi

commit sha 0a517b5f868d1637a24ca6a373ed42bbe3459f7a

Add necessary variables for trt_convert in windows build

view details

Elena Zhelezina

commit sha 8001b0bd556a44c9a008e85edf8af30b3aeb97ce

Addressed CheckLint error.

view details

Elena Zhelezina

commit sha a2aa5e3f045a5916b20a63b58f824ed59710845a

Addressed review comments on PR for CONCAT operation.

view details

Elena Zhelezina

commit sha 6e45db6b4a319e4364f8cf5bebe8ba31933d9bd6

Removed false changes.

view details

Gaurav Singh

commit sha 4edec1af39565bd512bd2dd31c79e96766bf60a0

[core] scatter_op: Remove duplicate check for IsScalar()

view details

Frederic Bastien

commit sha 9581b90a911ec52d370984bff746a8bdfc0c6652

Do not emit condition when all threads fully unroll.

view details

push time in 19 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 Status ArithmeticOptimizer::SimplifyArithmeticOps(bool can_use_shapes) {     pipeline.AddStage<RemoveStackStridedSliceSameAxis>(ctx, ctx_ext);   if (options_.fuse_squared_diff)     pipeline.AddStage<FuseSquaredDiffStage>(ctx, ctx_ext);+  if (options_.fuse_mul_add)+    pipeline.AddStage<FuseMulAddStage>(ctx, ctx_ext);

see here : https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/pull/782/files?file-filters%5B%5D=.cc#diff-99de50a7ff03ef1c38515c21b3a4a934R258-R264

ekuznetsov139

comment created time in 20 days

Pull request review commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

 Status ArithmeticOptimizer::SimplifyArithmeticOps(bool can_use_shapes) {     pipeline.AddStage<RemoveStackStridedSliceSameAxis>(ctx, ctx_ext);   if (options_.fuse_squared_diff)     pipeline.AddStage<FuseSquaredDiffStage>(ctx, ctx_ext);+  if (options_.fuse_mul_add)+    pipeline.AddStage<FuseMulAddStage>(ctx, ctx_ext);

The other fused ops are inserted only in the non-XLA path. They are inserted in a pass which I believe runs quite late, after even the GPU placement is done for the nodes.

One option would be switch the FMA node creation to that later pass, and thus bypass this matter altogether.

ekuznetsov139

comment created time in 20 days

pull request commentROCmSoftwarePlatform/tensorflow-upstream

[Draft] Fused multiply add ops

Wouldn't implementing them in Eigen essentially defeat the purpose of fusing (since you'd be launching multiple GPU kernels inside each FMA call)?

think it should be possible to have the Eigen implementation launch just one kernel per op. otherwise I agree there is no point to it.

ekuznetsov139

comment created time in 20 days

more