profile
viewpoint
George Karpenkov cheshire @google Cupertino, CA http://metaworld.me

cheshire/onumerical 8

Numerical Library for OCaml

cheshire/django 2

Official clone of the Subversion repository.

cheshire/antlr3-python-example 1

Simple calculator in python, using ANTLR V3

cheshire/django-css 1

django-css is a fork of django_compressor that makes it easy to use CSS compilers with your Django projects. CSS compilers extend CSS syntax to include more powerful features such as variables and nested blocks, and pretty much rock all around.

cheshire/django-whatever 1

Unobtrusive test models creation for django

cheshire/jstree 1

jquery tree plugin

cheshire/mongoengine 1

A Python Object-Document-Mapper for working with MongoDB

cheshire/opam-repository 1

Main public package repository for OPAM, the source package manager of OCaml.

pull request commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

I'm looking into why do we get Windows failures, seems unrelated.

nouiz

comment created time in 2 days

Pull request review commenttensorflow/tensorflow

Add info about TF_DETERMINISTIC_OPS to version 2.1 release notes

 TensorFlow 2.1 will be the last TF release supporting Python 2. Python 2 support   * Changes rebatching for `tf.data datasets` + distribution strategies for better performance.   Note that the dataset also behaves slightly differently, in that the rebatched dataset cardinality will always be a multiple of the number of replicas. * `TensorRT`   * [TensorRT 6.0](https://developer.nvidia.com/tensorrt#tensorrt-whats-new) is now supported and enabled by default. This adds support for more TensorFlow ops including Conv3D, Conv3DBackpropInputV2, AvgPool3D, MaxPool3D, ResizeBilinear, and ResizeNearestNeighbor. In addition, the TensorFlow-TensorRT python conversion API is exported as `tf.experimental.tensorrt.Converter`.+  * Environment variable `TF_DETERMINISTIC_OPS` added. When set to "true" or "1", this environment variable makes `tf.nn.bias_add` operate deterministically (i.e. reproducibly) when XLA JIT compilation is *not* enabled. It also makes cuDNN convolution and max-pooling operate deterministically. This makes Keras Conv*D and MaxPool*D layers operate deterministically in both the forward and backward directions when running on a CUDA-enabled GPU.

Cool. Once the performance is closer to the atomic version, I hope we can even switch to deterministic reductions by default.

duncanriach

comment created time in 3 days

pull request commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

Let's have a separate PR for changes to num_threads_x, and discuss the trade-offs there.

nouiz

comment created time in 3 days

Pull request review commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(   int64 num_threads_y = reduction_dimensions.is_row_reduction ? 1 : kWarpSize;   int64 num_threads_x = [&] {     if (reduction_dimensions.is_row_reduction) {+      int cc_major = 0, cc_minor = 0;+      ir_emitter_context_->device_description().cuda_compute_capability(+          &cc_major, &cc_minor);+      int64 num_threads_x = kWarpSize * kWarpSize;+      if (cc_major >= 6 && smallest_input_dtype_bits <= 16) {+        num_threads_x = kWarpSize * 8;

So I could add an attribute inside the Reduce operations called "biggest_block" to prevent this slowdown

I am confused, what would this attribute capture?

nouiz

comment created time in 3 days

Pull request review commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(   int64 num_threads_y = reduction_dimensions.is_row_reduction ? 1 : kWarpSize;   int64 num_threads_x = [&] {     if (reduction_dimensions.is_row_reduction) {+      int cc_major = 0, cc_minor = 0;+      ir_emitter_context_->device_description().cuda_compute_capability(+          &cc_major, &cc_minor);+      int64 num_threads_x = kWarpSize * kWarpSize;+      if (cc_major >= 6 && smallest_input_dtype_bits <= 16) {+        num_threads_x = kWarpSize * 8;

I can't. For float16, it would works, I would get a 2.5x speed up instead of 3.5x. But for u8, I get a 0.95 speed up

Could we still commit this? 2.5x speed up for fp16 is very impressive, and it sounds well worth the trade-off for 0.95x u8 slowdown.

Also do you know why changing the tiling without cutting down the number of threads creates the slowdown?

So I could add an attribute inside the Reduce operations called "biggest_block" to prevent this slowdown

nouiz

comment created time in 3 days

Pull request review commenttensorflow/tensorflow

Add info about TF_DETERMINISTIC_OPS to version 2.1 release notes

 TensorFlow 2.1 will be the last TF release supporting Python 2. Python 2 support   * Changes rebatching for `tf.data datasets` + distribution strategies for better performance.   Note that the dataset also behaves slightly differently, in that the rebatched dataset cardinality will always be a multiple of the number of replicas. * `TensorRT`   * [TensorRT 6.0](https://developer.nvidia.com/tensorrt#tensorrt-whats-new) is now supported and enabled by default. This adds support for more TensorFlow ops including Conv3D, Conv3DBackpropInputV2, AvgPool3D, MaxPool3D, ResizeBilinear, and ResizeNearestNeighbor. In addition, the TensorFlow-TensorRT python conversion API is exported as `tf.experimental.tensorrt.Converter`.+  * Environment variable `TF_DETERMINISTIC_OPS` added. When set to "true" or "1", this environment variable makes `tf.nn.bias_add` operate deterministically (i.e. reproducibly) when XLA JIT compilation is *not* enabled. It also makes cuDNN convolution and max-pooling operate deterministically. This makes Keras Conv*D and MaxPool*D layers operate deterministically in both the forward and backward directions when running on a CUDA-enabled GPU.

@duncanriach As of https://github.com/tensorflow/tensorflow/commit/8b7a3db0b6e09415b5640be4986fb4d7c6e5209a reduction emitter respects TF_DETERMINISTIC_OPS. On our benchmark, it's performance is within 0.9x of the emitter which uses atomics.

duncanriach

comment created time in 3 days

Pull request review commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(   int64 num_threads_y = reduction_dimensions.is_row_reduction ? 1 : kWarpSize;   int64 num_threads_x = [&] {     if (reduction_dimensions.is_row_reduction) {+      int cc_major = 0, cc_minor = 0;+      ir_emitter_context_->device_description().cuda_compute_capability(+          &cc_major, &cc_minor);+      int64 num_threads_x = kWarpSize * kWarpSize;+      if (cc_major >= 6 && smallest_input_dtype_bits <= 16) {+        num_threads_x = kWarpSize * 8;

Sorry, this gets a bit complicated here: such a change would require a change here as well: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/tree_reduction_rewriter.cc#L43, which would then require changes to the corresponding tests.

Could we just commit a tiling change here, and then have a separate PR with a discussion for changing the number of threads in a block?

nouiz

comment created time in 3 days

Pull request review commenttensorflow/tensorflow

[ROCm][XLA:GPU] Merging xla postlayout assignment optimization pass

 class GpuCompiler : public LLVMCompiler {    virtual Status OptimizeHloPostLayoutAssignment(

Does it need to remain virtual?

jerryyin

comment created time in 6 days

Pull request review commenttensorflow/tensorflow

[ROCm][XLA:GPU] Merging xla postlayout assignment optimization pass

 Status GpuCompiler::PrepareHloModuleForIrEmitting(HloModule* hlo_module) {   return pipeline.Run(hlo_module).status(); } +Status GpuCompiler::OptimizeHloPostLayoutAssignment(+    HloModule* hlo_module, se::StreamExecutor* stream_exec,+    se::DeviceMemoryAllocator* device_allocator) {+  HloPassPipeline pipeline("post-layout_assignment");+  /* TODO(b/117531509): Use LayoutAssignment::InstructionCanChangeLayout after+   * fixing the ticket. */+  pipeline.AddInvariantChecker<HloVerifier>(+      /*layout_sensitive=*/true,+      /*allow_mixed_precision=*/false,+      LayoutAssignment::InstructionCanChangeLayout);++  pipeline.AddPass<ReductionDegenerateDimRemover>();+  pipeline.AddPass<ReductionLayoutNormalizer>();+  pipeline.AddPass<ReductionDimensionGrouper>();++  // The LayoutAssignment pass may leave behind kCopy instructions which are+  // duplicate or NOPs, so remove them with algebraic simplification and CSE.+  AlgebraicSimplifierOptions options;+  options.set_is_layout_sensitive(true);+  pipeline.AddPass<HloPassFix<AlgebraicSimplifier>>(options);++  if (hlo_module->config().debug_options().xla_gpu_deterministic_reductions()) {+    pipeline.AddPass<HloPassFix<GpuTreeReductionRewriter>>();+  }++  // Rewrite GEMMs into custom calls.+  pipeline.AddPass<GemmRewriter>();++  // Choose the fastest algorithm for each conv.+  //+  // We pick the algorithm before fusion so we can generate better HLO. After+  // GpuConvRewriter, our convolutions are CustomCalls which return a+  // tuple (conv_result, scratch_memory), and the each conv uses 0 bytes of+  // scratch:+  //+  //   customcall = (f32[...], f32[0])+  //   return gte(customcall, 0)+  //+  // The algorithm picker then chooses the best algorithm, and potentially+  // increases the scratch space.  It replaces customcall with new_tuple,+  // giving us the following:+  //+  //   new_customcall = (f32[...], f32[N])+  //   new_tuple = tuple(gte(new_customcall, 0), constant f32[0])+  //   return gte(new_tuple, 0)+  //+  // The new tuple and gte instructions then be simplified away, because+  // nobody is expected to use the scratch value.+  //+  // However, if we were to run GpuConvAlgorithmPicker after fusion+  // the gte(customcall, 0) would probably already be into a fusion node.  We+  // can't simplify across HloComputation boundaries, so in this case we+  // wouldn't be able to simplify away the new_tuple bits.+

Extra newline

jerryyin

comment created time in 6 days

Pull request review commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(   int64 num_threads_y = reduction_dimensions.is_row_reduction ? 1 : kWarpSize;   int64 num_threads_x = [&] {     if (reduction_dimensions.is_row_reduction) {+      int cc_major = 0, cc_minor = 0;+      ir_emitter_context_->device_description().cuda_compute_capability(+          &cc_major, &cc_minor);+      if (cc_major >= 6 && smallest_input_dtype_bits <= 16) {+        return num_threads_x = kWarpSize * 4;

Does this still make sense? I thought that with intra-block communication working it would make sense to saturate the block all the way to kWarpSize * kWarpSize if the dimension is big enough?

nouiz

comment created time in 6 days

Pull request review commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(    int64 num_threads_y = reduction_dimensions.is_row_reduction ? 1 : kWarpSize;   int64 num_threads_x = kWarpSize;+  if (reduction_dimensions.is_row_reduction) {+    int cc_major = 0, cc_minor = 0;+    ir_emitter_context_->device_description().cuda_compute_capability(+        &cc_major, &cc_minor);+    if (cc_major >= 6 && smallest_input_dtype_bits <= 16) {+      num_threads_x = kWarpSize * 4;

Do you plan to change this code again soon? No. Sorry again for not planning this better.

nouiz

comment created time in 6 days

Pull request review commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(    int64 num_threads_y = reduction_dimensions.is_row_reduction ? 1 : kWarpSize;   int64 num_threads_x = kWarpSize;+  if (reduction_dimensions.is_row_reduction) {+    int cc_major = 0, cc_minor = 0;+    ir_emitter_context_->device_description().cuda_compute_capability(+        &cc_major, &cc_minor);+    if (cc_major >= 6 && smallest_input_dtype_bits <= 16) {+      num_threads_x = kWarpSize * 4;

Sorry for colliding here: in the meantime I have committed https://github.com/tensorflow/tensorflow/commit/0c3feb1cadf79420b18cc0ce7731476ba45d00df which started using intra-block communication for row reduction, could you check whether these numbers need updating?

nouiz

comment created time in 8 days

Pull request review commenttensorflow/tensorflow

[XLA] Speed up row reduction on P100/V100 for float16 and [u]int8

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(            << " " << reduction_dimensions.dimensions[0] << " "            << reduction_dimensions.dimensions[1] << " "            << reduction_dimensions.dimensions[2];+  auto get_dtype_bits = [](const HloInstruction *i) {+    return primitive_util::BitWidth(i->shape().element_type());}; +  // For fusion with multiple inputs, use the smallest input dtype to+  // select the reduction_tiling.+  int smallest_input_dtype_bits = get_dtype_bits(first_reduce->operand(0));+  for (auto input: unnested_hlo->operands()) {

Explicit type preferable

nouiz

comment created time in 8 days

issue closedtensorflow/tensorflow

Unify the code for GpuCompiler::OptimizeHloPostLayoutAssignment for AMD and NVidia in XLA

The code in GpuCompiler::OptimizeHloPostLayoutAssignment subclasses is essentially duplicated for AMD and NVidia in XLA. This already leads to subtle bugs: TreeReductionRewriterPass is applied to NVidia, but not to AMD.

Would it be possible to unify those, put the code in gpu_compiler.cc, and just check the platform to dynamically choose whether to apply NVidia-specific or AMD-specific passes?

closed time in 9 days

cheshire

issue openedtensorflow/tensorflow

Unify the code for GpuCompiler::OptimizeHloPostLayoutAssignment for AMD and NVidia in XLA

The code in GpuCompiler::OptimizeHloPostLayoutAssignment subclasses is essentially duplicated for AMD and NVidia in XLA. This already leads to subtle bugs: TreeReductionRewriterPass is applied to NVidia, but not to AMD.

Would it be possible to unify those, put the code in gpu_compiler.cc, and just check the platform to dynamically choose whether to apply NVidia-specific or AMD-specific passes?

created time in 9 days

pull request commenttensorflow/tensorflow

[XLA] Fix a latent bug.

Now this is obviously a bug, but have you checked the usages? Do they actually need a block ID zero? Any ideas on why it was working before?

nouiz

comment created time in 9 days

issue commenttensorflow/tensorflow

Executing XLA compiled function inside tf.GradientTape context leads to extraneous GPU kernels and D2D copies

Actually apologies, this was the incorrect commit, the proper fix has still not landed yet.

ifed-ucsd

comment created time in 10 days

IssuesEvent

issue commenttensorflow/tensorflow

Executing XLA compiled function inside tf.GradientTape context leads to extraneous GPU kernels and D2D copies

You would have to wait for it to land after it passes internal testing.

ifed-ucsd

comment created time in 11 days

issue commenttensorflow/tensorflow

Executing XLA compiled function inside tf.GradientTape context leads to extraneous GPU kernels and D2D copies

I'll mark the bug as fixed based on removing those extra copies. From what I understand, the fact that the kernel for gradient computation is launched regardless of whether derivative is used is just a property of eager runtime, and can not really be avoided (if you don't need the gradients => don't launch under the gradient tape).

ifed-ucsd

comment created time in 11 days

issue commenttensorflow/tensorflow

Executing XLA compiled function inside tf.GradientTape context leads to extraneous GPU kernels and D2D copies

Thanks for such a detailed bug report! I have a fix ready which removes the two extra D2D copies.

ifed-ucsd

comment created time in 11 days

issue commenttensorflow/tensorflow

device_lib.list_local_devices() InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].

@betterze It's an environment variable, env TF_XLA_FLAGS='--tf_xla_enable_xla_devices=false' ./yourscript.py

shun-lin

comment created time in 11 days

issue commenttensorflow/tensorflow

device_lib.list_local_devices() InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].

@betterze Have you tried running under the environment variable specified in the previous comment?

shun-lin

comment created time in 11 days

pull request commenttensorflow/tensorflow

[ROCm][XLA] Adding three passes to amdgpu compiler

That's strange, let me try again.

jerryyin

comment created time in 13 days

pull request commenttensorflow/tensorflow

[XLA] clean up generated LLVM reduction code.

@nouiz Actually I was wrong, sorry again. Something went unexpected here, let me double check what happened.

nouiz

comment created time in 17 days

pull request commenttensorflow/tensorflow

[XLA] clean up generated LLVM reduction code.

It is now merged @nouiz No it's not, check out the "merge" message: ArmageddonKnight pushed a commit to UofT-EcoSystem/tensorflow that referenced this pull request

nouiz

comment created time in 17 days

issue commentpreservim/nerdcommenter

Broken code after commenting / uncommenting

I was able to use this as a workaround:

let g:NERDLPlace = "/*"
let g:NERDRPlace = "*/"
ki11roy

comment created time in 18 days

Pull request review commenttensorflow/tensorflow

[XLA] clean up generated LLVM reduction code.

 static void EmitTile(   IrArray::Index source_idx = tile_origin_index.AddOffsetToDim(       start_offset_x, KernelMappingScheme::DimX, b); +  // True when all threads will always execute all instructions.+  // So we do not need to emit condition.+  bool always_full_tile =+      mapping_scheme.GetDimsInElems()[2] % tile_size_x == 0;

Suppose we get

num_threads_x=32
tile_size_x=32
dims[2]=512

512 is divisible by 32, but not all threads will fit, as e.g. thread 15 will try to make accesses 480, 512, 544, 576, etc, and only the first two will be in bounds.

nouiz

comment created time in 19 days

Pull request review commenttensorflow/tensorflow

[XLA] clean up generated LLVM reduction code.

 Status KernelSupportLibrary::IfWithStatus(     absl::string_view name, llvm::Value* condition,     const std::function<Status()>& true_block_generator,     const std::function<Status()>& false_block_generator) {-  llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(condition, name, b_);+  llvm_ir::LlvmIfData if_data = llvm_ir::EmitIfThenElse(+      condition, name, b_, false_block_generator != nullptr);

Insert a comment on what parameter are you populating, e.g. /*param_name=*/...

nouiz

comment created time in 19 days

Pull request review commenttensorflow/tensorflow

[XLA] clean up generated LLVM reduction code.

 static void EmitTile(   IrArray::Index source_idx = tile_origin_index.AddOffsetToDim(       start_offset_x, KernelMappingScheme::DimX, b); +  // True when all threads will always execute all instructions.+  // So we do not need to emit condition.+  bool always_full_tile =+      mapping_scheme.GetDimsInElems()[2] % tile_size_x == 0;

Note that tile_size_x is per thread block, wouldn't you need to divide by num_threads_x first?

nouiz

comment created time in 19 days

issue commenttensorflow/tensorflow

device_lib.list_local_devices() InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].

Also could you say whether the crash is still there if you run under TF_XLA_FLAGS='--tf_xla_enable_xla_devices=false' ?

shun-lin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA:GPU] Fixing Atomic CAS codegen in ir_emitter

 Status IrEmitter::EmitAtomicOperationUsingCAS(const HloComputation& computation,         IntToPtr(atomic_memory_address, atomic_address_type);     binop_output_address =         Add(PtrToInt(cas_new_output_address, address_int_type), offset);-    binop_output_address = IntToPtr(binop_output_address, element_address_type);+    binop_output_address = IntToPtr(

@jerryyin This is another piece of infrastructure which is not working quite right, even though internally we build with -Werror, the external BUILD configuration is built with -w, which ignores all the errors.

jerryyin

comment created time in a month

PR closed tensorflow/tensorflow

Fix invalid XLA_FLAGS documentation cla: yes comp:xla ready to pull size:XS

Setting XLA_FLAGS="--dump_hlo_as_text" is invalid and causes a catastrophic error; --xla_dump_hlo_as_text is the correct flag.

+1 -1

2 comments

1 changed file

kkroening

pr closed time in a month

pull request commenttensorflow/tensorflow

Fix invalid XLA_FLAGS documentation

Going even further, dump_hlo_as_text is not even required, since that's the default. For some reason, the system is having difficulties importing this commit. I've committed the change internally, referencing your name and this pull request.

kkroening

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 limitations under the License. #include "tensorflow/compiler/xla/window_util.h" #include "tensorflow/core/lib/core/errors.h" +// Convenient function to cast the provided llvm::Value* using IRBuilder+// to default address space. This is useful in particular in generating+// IR for AMDGPU target, as its kernel variables is in address space 5

"for generating", "are in address space", "instead of the default"

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Updating amdgpu compiler and enabling llvm compiler test

 limitations under the License. #include "tensorflow/stream_executor/stream_executor.h"  namespace xla {+namespace gpu {++// Creating dummy data structure needed to initialize a GpuCompilerTest+PLATFORM_DEFINE_ID(kDummyTestId);+constexpr char kDummyTriple[] = "dummy-triple";+constexpr char kDummyLayout[] = "e";++// This class is is a dummy implementation of GpuCompiler and is targeted for+// unit test only+class GpuCompilerTest : public GpuCompiler {

*Test is perhaps not a best name, since it is reserved for actual tests.

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Updating amdgpu compiler and enabling llvm compiler test

 Status AMDGPUCompiler::OptimizeHloPostLayoutAssignment(       /*allow_mixed_precision=*/false,       LayoutAssignment::InstructionCanChangeLayout); +  pipeline.AddPass<ReductionDegenerateDimRemover>();+  pipeline.AddPass<ReductionLayoutNormalizer>();+  pipeline.AddPass<ReductionDimensionGrouper>();

I guess sharing this registration between nvptx_compiler and amdgpu_compiler could be done in a separate PR?

jerryyin

comment created time in a month

pull request commenttensorflow/tensorflow

Fix invalid XLA_FLAGS documentation

Thanks!

kkroening

comment created time in a month

PR closed tensorflow/tensorflow

[ROCm] Fix for compile error in //tensorflow/compiler/xla/service:dynamic_padder_test cla: yes comp:xla ready to pull size:XS

On the ROCm platform, we currently get the following compile failure for the test //tensorflow/compiler/xla/service:dynamic_padder_test

...
INFO: Deleting stale sandbox base /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/sandbox
ERROR: /root/tensorflow/tensorflow/compiler/xla/service/BUILD:2311:1: C++ compilation of rule '//tensorflow/compiler/xla/service:dynamic_padder_test_gpu' failed (Exit 1)
tensorflow/compiler/xla/service/dynamic_padder_test.cc: In member function 'virtual void xla::{anonymous}::ExecutionTest_ScatterUpdate_Test::TestBody()':
tensorflow/compiler/xla/service/dynamic_padder_test.cc:266:7: error: 'StrReplaceAll' is not a member of 'absl'
       absl::StrReplaceAll(hlo_text, {{"INDICES_BOUND", "2"}});
       ^
tensorflow/compiler/xla/service/dynamic_padder_test.cc:281:7: error: 'StrReplaceAll' is not a member of 'absl'
       absl::StrReplaceAll(hlo_text, {{"INDICES_BOUND", "4"}});
...

This fix resolves the compile error, and gets the test passing again on the ROCm platform

On the ROCm platform, this test is compiled via the following gcc compiler

root@ixt-rack-04:/root/tensorflow# gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

The crosstool setup / compile invocation on the ROCm platform is done via

  • https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/rocm_configure.bzl
  • https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_rocm.tpl

/cc @cheshire @whchung

+1 -0

2 comments

1 changed file

deven-amd

pr closed time in a month

pull request commenttensorflow/tensorflow

[ROCm] Fix for compile error in //tensorflow/compiler/xla/service:dynamic_padder_test

Should be fixed in https://github.com/tensorflow/tensorflow/commit/48a1f6cac71e216c076941e4fb449613bac59f05

deven-amd

comment created time in a month

PR closed tensorflow/tensorflow

[ROCm] Fix for compile error in //tensorflow/compiler/xla:refcounting_hash_map_test cla: yes comp:xla size:XS

On the ROCm platform, we currently get the following compile failure for the test //tensorflow/compiler/xla:refcounting_hash_map_test

...
ERROR: /root/tensorflow/tensorflow/compiler/xla/BUILD:928:1: C++ compilation of rule '//tensorflow/compiler/xla:refcounting_hash_map_test' failed (Exit 1)
tensorflow/compiler/xla/refcounting_hash_map_test.cc: In member function 'virtual void xla::{anonymous}::RefcountingHashMapTest_ForEachEmpty_Test::TestBody()':
tensorflow/compiler/xla/refcounting_hash_map_test.cc:80:3: error: 'int64' was not declared in this scope
   int64 count = 0;
...

This fix resolves the compile error, and gets the test passing again on the ROCm platform

On the ROCm platform, this test is compiled via the following gcc compiler

root@ixt-rack-04:/root/tensorflow# gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

The crosstool setup / compile invocation on the ROCm platform is done via

  • https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/rocm_configure.bzl
  • https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_rocm.tpl

/cc @cheshire @whchung

+1 -1

1 comment

1 changed file

deven-amd

pr closed time in a month

pull request commenttensorflow/tensorflow

[ROCm] Fix for compile error in //tensorflow/compiler/xla:refcounting_hash_map_test

Should be fixed in https://github.com/tensorflow/tensorflow/commit/1bb2e82cdffc62a363a3c68dbdaf31826ee3358a

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Updating amdgpu compiler and enabling llvm compiler test

 TEST_F(CpuCompilerTest, HooksTest) { }  TEST_F(GpuCompilerTest, HooksTest) {+#if TENSORFLOW_USE_ROCM+  gpu::AMDGPUCompiler compiler;+#elif GOOGLE_CUDA   gpu::NVPTXCompiler compiler;+#endif   TestCompilerHooks(&compiler);

Sorry, that wasn't my question.

My question is, why is the method OptimizeHloPostLayoutAssignment distinct for NVPTXCompiler and AMDGPUCompiler in the first place?

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 Status IrEmitter::EmitCallToNestedComputation(     emitted_function = ir_emitter_nested.GetEmittedFunction();   } -  std::vector<llvm::Value*> arguments(operands.begin(), operands.end());-  arguments.push_back(output);+  // For AMDGPU target, may need to addrspacecast alloca variables from+  // addrspace 5 to addrspace 0+  std::vector<llvm::Value*> arguments;+  for (auto& arg : operands) {+    llvm::Value* casted_arg = MayAddrSpaceCastArg(arg, b_);+    arguments.push_back(casted_arg);+  }++  llvm::Value* casted_output = MayAddrSpaceCastArg(output, b_);+  arguments.push_back(casted_output);++  // temp buffer base is always in addrspace 0 so it's not required to

Full sentence with proper capitalization and punctuation.

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 limitations under the License. #include "tensorflow/compiler/xla/window_util.h" #include "tensorflow/core/lib/core/errors.h" +namespace {++static llvm::Value* MayAddrSpaceCastArg(llvm::Value* arg,

Also a better name seems to be AddrCastIfNecessary or something similar. Just AddrCastToDefault is also probably fine.

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 std::unique_ptr<KernelThunk> IrEmitterUnnested::BuildKernelThunk(     llvm::Type* int8_double_pointer =         llvm::PointerType::get(b_.getInt8PtrTy(), /*AddressSpace=*/0);     for (int64 idx : gte_index) {-      loc = BitCast(loc, int8_double_pointer);+      loc = PointerBitCastOrAddrSpaceCast(loc, int8_double_pointer);

Optional, but I would prefer keeping the builder.Create... rather than creating a mixin wrapper, as the latter breaks goto-definition and argument-lookup IDE functionality.

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 limitations under the License. #include "tensorflow/compiler/xla/window_util.h" #include "tensorflow/core/lib/core/errors.h" +namespace {++static llvm::Value* MayAddrSpaceCastArg(llvm::Value* arg,+                                        llvm::IRBuilder<>& builder) {+  llvm::Type* arg_type = arg->getType();+  CHECK_EQ(true, arg_type->isPointerTy());

CHECK(arg_type->isPointerTy())

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 Status IrEmitter::EmitCallToNestedComputation(     emitted_function = ir_emitter_nested.GetEmittedFunction();   } -  std::vector<llvm::Value*> arguments(operands.begin(), operands.end());-  arguments.push_back(output);+  // For AMDGPU target, may need to addrspacecast alloca variables from+  // addrspace 5 to addrspace 0+  std::vector<llvm::Value*> arguments;+  for (auto& arg : operands) {

This can be written somewhat more explicitly using absl::c_transform, e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/client/xla_builder.cc#L721

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 limitations under the License. #include "tensorflow/compiler/xla/window_util.h" #include "tensorflow/core/lib/core/errors.h" +namespace {++static llvm::Value* MayAddrSpaceCastArg(llvm::Value* arg,+                                        llvm::IRBuilder<>& builder) {

We usually just call those b

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 limitations under the License. #include "tensorflow/compiler/xla/window_util.h" #include "tensorflow/core/lib/core/errors.h" +namespace {

You don't need an empty namespace if the function is marked as "static"

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 limitations under the License. #include "tensorflow/compiler/xla/window_util.h" #include "tensorflow/core/lib/core/errors.h" +namespace {++static llvm::Value* MayAddrSpaceCastArg(llvm::Value* arg,

Could you insert a comment with semantics for this function? What does it accept, what does it return?

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Adding address space cast in ir_emitter

 Status IrEmitter::EmitCallToNestedComputation(     emitted_function = ir_emitter_nested.GetEmittedFunction();   } -  std::vector<llvm::Value*> arguments(operands.begin(), operands.end());-  arguments.push_back(output);+  // For AMDGPU target, may need to addrspacecast alloca variables from

We use full sentences with proper punctuation for comments.

jerryyin

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm][XLA] Updating amdgpu compiler and enabling llvm compiler test

 TEST_F(CpuCompilerTest, HooksTest) { }  TEST_F(GpuCompilerTest, HooksTest) {+#if TENSORFLOW_USE_ROCM+  gpu::AMDGPUCompiler compiler;+#elif GOOGLE_CUDA   gpu::NVPTXCompiler compiler;+#endif   TestCompilerHooks(&compiler);

We don't really have a hard rule, since this is all fairly new. My consideration is this: internally, we do not currently have a pre-submit which runs ROCm test. We have a pre-submit which runs a ROCm build, but it's not even triggered on every commit. So I can very well imagine a tiny well-intentioned change which accidentally breaks the ROCm build, since this is currently not triggered everywhere. This creates a cascade of problems: your team will discover that the bot is broken, or the next person editing the ROCm-specific files will notice that. Thus it is much better to always build both branches, avoiding such issues in the future.

But in this case I understand your point about not being able to compile NVPTXCompiler without CUDA. However, do you think it would be possible to share the entire code block without duplication in gpu_compiler?

jerryyin

comment created time in a month

pull request commenttensorflow/tensorflow

[ROCm] Fix for compile error in //tensorflow/compiler/xla/service:dynamic_padder_test

Looks good, but note that such changes usually require a corresponding change in the BUILD file as well.

deven-amd

comment created time in a month

pull request commenttensorflow/tensorflow

[ROCm] Minor updates to sync some of the ROCm fork contents

Let's break this up into multiple PRs and discuss one by one. It would be also good to understand why you are seeing a compilation failure but we don't. What toolchain do you use?

deven-amd

comment created time in a month

pull request commenttensorflow/tensorflow

[XLA] Small update on the vectorisation of elementwise operations

Also fix XLA profiler to display correctly some operation as TFLOPS isntead of FLOPS.

I think this is not in this PR?

Also would it be really hard to write FileCheck tests?

nouiz

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] Minor updates to sync some of the ROCm fork contents

 limitations under the License.  #include "tensorflow/compiler/xla/service/dynamic_padder.h" +#include "absl/strings/str_replace.h"

If required, this needs to be a separate PR.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] Minor updates to sync some of the ROCm fork contents

 limitations under the License. #include "absl/container/node_hash_map.h" #include "absl/memory/memory.h" #include "absl/synchronization/mutex.h"+#include "tensorflow/core/platform/logging.h"

Is this required?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] Minor updates to sync some of the ROCm fork contents

 namespace xla {  // Test that the xla_backend_extra_options flag is parsed correctly. TEST(DebugOptionsFlags, ParseXlaBackendExtraOptions) {-  std::unordered_map<string, string> test_map;-  string test_string = "aa=bb,cc,dd=,ee=ff=gg";+  std::unordered_map<std::string, std::string> test_map;+  std::string test_string = "aa=bb,cc,dd=,ee=ff=gg";

While this change is correct, if required, it should be done in a separate dedicated PR.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] Minor updates to sync some of the ROCm fork contents

 TEST(RefcountingHashMapTest, CustomFactory) {  TEST(RefcountingHashMapTest, ForEachEmpty) {   RefcountingHashMap<int, int> m;-  int64 count = 0;+  int64_t count = 0;

Why this change, int64 is used everywhere throughout the codebase?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] Minor updates to sync some of the ROCm fork contents

 TYPED_TEST(NcclManagerTest, MultiNodeBroadcast) {   this->RunMultiNodeBroadcastTest(num_nodes, num_ranks_per_node,                                   /*src_node=*/0, /*src_local_rank=*/0,                                   /*in_place=*/true);-#endif }+#endif

How was this even compiling before?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuKernelTilingTest, ColumnReductionMOFUnrolled) {   auto hlo_module =       ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment())           .ValueOrDie();-  CompileAndVerifyIr(std::move(hlo_module),-                     R"(+  auto expected_ir = is_built_with_rocm_ ? R"(+; CHECK-LABEL: define amdgpu_kernel void @fusion+;+; CHECK-LABEL: atomic_op_loop_body{{.*}}:+; CHECK: %[[fadd:.*]] = fadd float %{{.*}}, %{{.*}}+; CHECK: %[[bitcast:.*]] = bitcast float %[[fadd]] to i32+; CHECK: %{{.*}} = cmpxchg i32* %{{.*}}, i32 %{{.*}}, i32 %[[bitcast]]

Optional: I think the important bit is that we have 4 cmpxchg instructions, I would even match only those.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuSliceInputFusionTest, InputFusionWithOnlyOneSlice) {   auto hlo_module =       ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment())           .ValueOrDie();-  CompileAndVerifyIr(std::move(hlo_module),-                     R"(+  auto expected_ir = is_built_with_rocm_ ? R"(+; CHECK-LABEL: define amdgpu_kernel void @fusion

OK sure, in general we have to weigh in the risk of one of the branches getting stale (recall that currently we do not run ROCM tests internally, so it's possible that someone would update the nvidia branch, but not the ROCM branch) vs. the exact pattern you want to match. So here I would say that matching amdgpu_kernel is not important enough to have two branches, but since the pattern is small I don't care that much either.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuSliceInputFusionTest, InputFusionWithOnlyOneSlice) {   auto hlo_module =       ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment())           .ValueOrDie();-  CompileAndVerifyIr(std::move(hlo_module),-                     R"(+  auto expected_ir = is_built_with_rocm_ ? R"(+; CHECK-LABEL: define amdgpu_kernel void @fusion

Could we just optionally match amdgpu_kernel using a regexp instead of duplication?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 void GpuCodegenTest::CompileAndVerifyPtx(   std::unique_ptr<Executable> executable =       std::move(CompileToExecutable(std::move(hlo_module)).ValueOrDie());   string ptx_str(static_cast<GpuExecutable*>(executable.get())->text());++  // On the ROCM platform the "ptx" string is not populated for the compiled+  // executable, and hence the "ptx_str" will be empty. So disabling the+  // pattern check on the ROCm platform+#if !defined(TENSORFLOW_USE_ROCM)

Sorry, same comment as before, would it be possible to make it a runtime check?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuSliceInputFusionTest, InputFusionWithOnlyOneSlice) {   auto hlo_module =       ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment())           .ValueOrDie();-  CompileAndVerifyIr(std::move(hlo_module),-                     R"(+  auto expected_ir = is_built_with_rocm_ ? R"(+; CHECK-LABEL: define amdgpu_kernel void @fusion

Same applies below.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 class GpuFtzDisabledTest : public GpuFtzTest { };  // Check that we emit mul.ftz.f32 when in ftz mode, and plain mul.f32 otherwise.+//+// On the ROCM platform the "ptx" string is not populated for the compiled+// executable, and hence the call to CompileAdnVerifyPtx does not do the+// "VerifyPtx" part, it merely compiles the executable

Typo: CompileAndVerifyPtx.

Are we testing anything in ROCM mode here at all? Would you like to just test the compilation?

Currently the API creates a false impression of testing on ROCM which ideally would be good to avoid.

You could also have a test-local function e.g. CompileAndOptionallyVerifyPtx, which would verify PTX only if running on CUDA?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuIndexTest, CompatibleUseLinearIndexWithReshapeAndBroadcast) {                     .ValueOrDie();    // Check the optimized IR reuses the linear index by calculating modulo 14.-  CompileAndVerifyIr(std::move(module),-                     R"(++  // In the IR generated for AMDGPUs, we do not seem to have the+  // the addrspace(1) attribute for the lines being checked by the following+  // patterns still need to investigate why that is the case, and whether or not+  // it is ok+  auto expected_ir = is_built_with_rocm_ ? R"(+; CHECK: %[[urem1:.*]] = urem i{{[0-9]*}} %[[linear_index:.*]], 14+; CHECK: %[[bitcast:.*]] = bitcast i8* %[[alloc:.*]] to float*

It's probably not worth it to duplicate the pattern to catch addrspace cast. Could you use a regexp to optionally match and addrspace cast instead?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 void GpuCodegenTest::CompileAndVerifyPtx(   std::unique_ptr<Executable> executable =       std::move(CompileToExecutable(std::move(hlo_module)).ValueOrDie());   string ptx_str(static_cast<GpuExecutable*>(executable.get())->text());++  // On the ROCM platform the "ptx" string is not populated for the compiled+  // executable, and hence the "ptx_str" will be empty. So disabling the+  // pattern check on the ROCm platform+#if !defined(TENSORFLOW_USE_ROCM)

You even have a is_built_with_rocm_ variable now

deven-amd

comment created time in a month

PR closed tensorflow/tensorflow

[XLA] Update the path for the LLVM FileCheck executable cla: yes comp:gpu ready to pull size:XS

The path for the FileCheck executbale needs to be updated as a consequence of the following commit.

https://github.com/tensorflow/tensorflow/commit/d6ba353dd974f3d883734e96653cdee7fe6abfbc

After that commit the following test (and many others) start failing with the following error

bazel test //tensorflow/compiler/xla/service/cpu/tests:cpu_intrinsic_test
...
...
2020-01-10 19:48:15.756796: W tensorflow/compiler/xla/tests/filecheck.cc:72]
Tried to execute FileCheck at /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/compiler/xla/service/cpu/tests/cpu_intrinsic_test.runfiles/org_tensorflow/external/llvm/FileCheck

2020-01-10 19:48:15.756884: W tensorflow/compiler/xla/tests/filecheck.cc:74]
NOTE: FileCheck binary does not exist!
...
...

This fix updates the path of the LLVM FileCheck executable to the correct one.


/cc @cheshire @whchung

@cheshire , the bazel test command above was failing in the upstream repo as of a few minutes ago. Note that I was running without --config=rocm (i.e. with a CPU only TF build)

+1 -1

6 comments

1 changed file

deven-amd

pr closed time in a month

pull request commenttensorflow/tensorflow

[XLA] Update the path for the LLVM FileCheck executable

Seems you have found a sizeable whole in our OSS testing, thanks! This should be fixed in a different layer, I'll push out a fix shortly.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] Minor updates to sync some of the ROCm fork contents

 def testBasicNotActuallyTriangular(self):       self._VerifyTriangularSolveCombo(a.astype(dtype), b.astype(dtype))    def testBasicComplexDtypes(self):++    if xla_test.test.is_built_with_rocm():+      # TODO(rocm)

The TF convention is to have usernames of responsible people inside the TODO block, not a feature name. It might be better to remove this comment TODO entirely

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuKernelTilingTest, RowReductionWithSmallDimensionNotTiled) {       ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment())           .ValueOrDie();   CompileAndVerifyIr(std::move(hlo_module),+#if TENSORFLOW_USE_ROCM

I think I have replied using an email, but just in case it didn't make it:

There are global getters. You could either do PlatformUtil::GetDefaultPlatform() and try to get a stream executor from there (e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/tests/while_test.cc#L1267), or check whether se::MultiPlatformManager::PlatformWithName returns a platform for a given name (cf. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/buffer_comparator_test.cc#L35)

Having a runtime switch has another advantage that both branches would always be compiled, and editing one would not accidentally break the compilation of another.

deven-amd

comment created time in a month

pull request commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

There are global getters. You could either do PlatformUtil::GetDefaultPlatform() and try to get a stream executor from there (e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/tests/while_test.cc#L1267), or check whether se::MultiPlatformManager::PlatformWithName returns a platform for a given name (cf. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/buffer_comparator_test.cc#L35 )

On Fri, Jan 10, 2020 at 5:15 PM Deven Desai notifications@github.com wrote:

@deven-amd commented on this pull request.

In tensorflow/compiler/xla/service/gpu/tests/gpu_kernel_tiling_test.cc https://github.com/tensorflow/tensorflow/pull/35572#discussion_r365486507 :

@@ -602,11 +769,19 @@ TEST_F(GpuKernelTilingTest, RowReductionWithSmallDimensionNotTiled) { ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment()) .ValueOrDie(); CompileAndVerifyIr(std::move(hlo_module), +#if TENSORFLOW_USE_ROCM

Thought about this some more, and I do not know whether this is feasible.

Currently we can use the stream executor to do a Runtime check to determine whether we are using ROCm or CUDA. (see example here : https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc#L322-L326 )

How do I get access to a stream executor object within this test object?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/pull/35572?email_source=notifications&email_token=AACVGH2IIMMT3X4CXCDF4ZDQ5EMSBA5CNFSM4KCRPLP2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCRNMFEA#discussion_r365486507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACVGH3EYCRXKYIE2AAPJV3Q5EMSBANCNFSM4KCRPLPQ .

deven-amd

comment created time in a month

pull request commenttensorflow/tensorflow

[XLA] Update the path for the LLVM FileCheck executable

@jpienaar Any objections on submitting this? Or should we wait?

deven-amd

comment created time in a month

pull request commenttensorflow/tensorflow

[XLA] Update the path for the LLVM FileCheck executable

@jpienaar Do you think this commit is necessary? Apparently, the previous path has stopped working after https://github.com/tensorflow/tensorflow/commit/d6ba353dd974f3d883734e96653cdee7fe6abfbc . I wonder then how come our OSS bots are still passing.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 StatusOr<bool> RunFileCheck(const std::string& input,    // Invoke FileCheck to check whether input matches `pattern`.   const char* file_check_path_suffix =-      "org_tensorflow/external/llvm/FileCheck";+      "org_tensorflow/external/llvm-project/llvm/FileCheck";

I do not have visibility into your end

You are right that tests are run in a different environment inside Google. However, presubmits which are being run by Github bot externally should be run inside the environment identical to yours, and you sort of do have visibility into it, as the bot should log all the actions it performs and should start from a reproducible state.

In any case, if the tests don't pass for you without this change, could you file a separate PR with this change only, and we could discuss/check it there?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuKernelTilingTest, RowReductionWithSmallDimensionNotTiled) {       ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment())           .ValueOrDie();   CompileAndVerifyIr(std::move(hlo_module),+#if TENSORFLOW_USE_ROCM

I'd still prefer a runtime check, if possible, as in future we might want to switch to shipping both ROCM and CUDA in the same binary.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 StatusOr<bool> RunFileCheck(const std::string& input,    // Invoke FileCheck to check whether input matches `pattern`.   const char* file_check_path_suffix =-      "org_tensorflow/external/llvm/FileCheck";+      "org_tensorflow/external/llvm-project/llvm/FileCheck";

Right, but why the tests are passing now without this change?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 StatusOr<bool> RunFileCheck(const std::string& input,    // Invoke FileCheck to check whether input matches `pattern`.   const char* file_check_path_suffix =-      "org_tensorflow/external/llvm/FileCheck";+      "org_tensorflow/external/llvm-project/llvm/FileCheck";

Why this change?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 def tf_xla_py_test(                 "--types=DT_HALF,DT_FLOAT,DT_DOUBLE,DT_UINT8,DT_QUINT8,DT_INT8,DT_QINT8,DT_INT32,DT_QINT32,DT_INT64,DT_BOOL,DT_COMPLEX64,DT_COMPLEX128",             ]         elif backend == "gpu":-            backend_args += [-                "--test_device=" + gpu_xla_device,-                "--types=DT_HALF,DT_FLOAT,DT_DOUBLE,DT_UINT8,DT_QUINT8,DT_INT8,DT_QINT8,DT_INT32,DT_QINT32,DT_INT64,DT_BOOL,DT_COMPLEX64,DT_COMPLEX128,DT_BFLOAT16",-            ]+            if rocm_is_configured():+                backend_args += [+                    "--test_device=" + gpu_xla_device,+                    "--types=DT_HALF,DT_FLOAT,DT_DOUBLE,DT_UINT8,DT_QUINT8,DT_INT8,DT_QINT8,DT_INT32,DT_QINT32,DT_INT64,DT_BOOL,DT_COMPLEX64,DT_BFLOAT16",+                ]+            else:+                backend_args += [

Is it possible to avoid the duplication in these cases? Currently it's hard to see what is the difference between the cases.

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 class GpuLdgTest : public GpuCodegenTest {};  // Parameters are never overwritten, so parameter reads should get ld.global.nc // reads.+//+// On the ROCM platform the "ptx" string is not populated for the compiled

Should we just disable these tests for ROCM?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

[ROCm] XLA unit-test updates for the ROCm platform

 TEST_F(GpuKernelTilingTest, RowReductionWithSmallDimensionNotTiled) {       ParseAndReturnVerifiedModule(kHloString, ConfigWithoutLayoutAssignment())           .ValueOrDie();   CompileAndVerifyIr(std::move(hlo_module),+#if TENSORFLOW_USE_ROCM

Could we at runtime instead of an ifdef here?

deven-amd

comment created time in a month

Pull request review commenttensorflow/tensorflow

Add info about TF_DETERMINISTIC_OPS to version 2.1 release notes

 TensorFlow 2.1 will be the last TF release supporting Python 2. Python 2 support   * Changes rebatching for `tf.data datasets` + distribution strategies for better performance.   Note that the dataset also behaves slightly differently, in that the rebatched dataset cardinality will always be a multiple of the number of replicas. * `TensorRT`   * [TensorRT 6.0](https://developer.nvidia.com/tensorrt#tensorrt-whats-new) is now supported and enabled by default. This adds support for more TensorFlow ops including Conv3D, Conv3DBackpropInputV2, AvgPool3D, MaxPool3D, ResizeBilinear, and ResizeNearestNeighbor. In addition, the TensorFlow-TensorRT python conversion API is exported as `tf.experimental.tensorrt.Converter`.+  * Environment variable `TF_DETERMINISTIC_OPS` added. When set to "true" or "1", this environment variable makes `tf.nn.bias_add` operate deterministically (i.e. reproducibly) when XLA JIT compilation is *not* enabled. It also makes cuDNN convolution and max-pooling operate deterministically. This makes Keras Conv*D and MaxPool*D layers operate deterministically in both the forward and backward directions when running on a CUDA-enabled GPU.

I don't think we'll cherry-pick it to 2.1. I expect it to end up in 2.2. If we are lucky, it might become the default by then, but no promises.

duncanriach

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Add info about TF_DETERMINISTIC_OPS to version 2.1 release notes

 TensorFlow 2.1 will be the last TF release supporting Python 2. Python 2 support   * Changes rebatching for `tf.data datasets` + distribution strategies for better performance.   Note that the dataset also behaves slightly differently, in that the rebatched dataset cardinality will always be a multiple of the number of replicas. * `TensorRT`   * [TensorRT 6.0](https://developer.nvidia.com/tensorrt#tensorrt-whats-new) is now supported and enabled by default. This adds support for more TensorFlow ops including Conv3D, Conv3DBackpropInputV2, AvgPool3D, MaxPool3D, ResizeBilinear, and ResizeNearestNeighbor. In addition, the TensorFlow-TensorRT python conversion API is exported as `tf.experimental.tensorrt.Converter`.+  * Environment variable `TF_DETERMINISTIC_OPS` added. When set to "true" or "1", this environment variable makes `tf.nn.bias_add` operate deterministically (i.e. reproducibly) when XLA JIT compilation is *not* enabled. It also makes cuDNN convolution and max-pooling operate deterministically. This makes Keras Conv*D and MaxPool*D layers operate deterministically in both the forward and backward directions when running on a CUDA-enabled GPU.

@duncandean I have landed deterministic reductions in XLA behind the flag (XLA_FLAGS=--xla_gpu_deterministic_reductions), this commit will appear in the repository shortly. We are currently evaluating the performance to see whether we can switch it on by default.

duncanriach

comment created time in 2 months

issue commenttensorflow/tensorflow

TF 2.0 XLA JIT reporting error: "./bin/ptxas not found"

@thincal What are you actually trying to do and why are you using an XLA device? If you just would like your computation to run on GPU, simply removing the with line should be sufficient.

@netw0rkf10w What workflow are you using?

Same issue with TensorFlow 2.1.0-rc2

Sorry I do not know which fix made into which release, I can only answer for behavior on nightly.

Instead of looking for ./bin/ptxas, shouldn't it check to see if ptxas is available first?

It shouldn't matter in which order do we look for ptxas, if it's in your $PATH, it will be found.

thincal

comment created time in 2 months

issue commenttensorflow/tensorflow

TF 2.0 XLA JIT reporting error: "./bin/ptxas not found"

@thincal Also to get a more complete error message, could you re-run with environment variable TF_CPP_VMODULE=status=5?

thincal

comment created time in 2 months

issue commenttensorflow/tensorflow

TF 2.0 XLA JIT reporting error: "./bin/ptxas not found"

@thincal XLA:GPU devices are semi-deprecated, I do not recommend to use them. As an alternative, could you use @tf.function(experimental_compile=True) (with a nightly TF build). Out of curiosity, where did you find the information about XLA:GPU devices? I don't think they are mentioned in the documentation anymore. @Artem-B Any clues why do we fail here? It should fall back to the driver compilation.

thincal

comment created time in 2 months

issue commenttensorflow/tensorflow

TF 2.0 XLA JIT reporting error: "./bin/ptxas not found"

@thincal Do you want to try tf-nightly-gpu package? I'm not sure if the fix made it into 2.0.0.

@sanjoy I'll update the error message to suggest fixing the PATH.

thincal

comment created time in 2 months

pull request commenttensorflow/tensorflow

Better warning log on allocation failures.

OK free seems indeed better, since that's what sysinfo doc is saying (http://man7.org/linux/man-pages/man2/sysinfo.2.html)

deepakm

comment created time in 2 months

pull request commenttensorflow/tensorflow

Better warning log on allocation failures.

Perhaps "available"?

deepakm

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 Status IrEmitterUnnested::EmitConstantGlobals() {   return Status::OK(); } +// Emits code for slices based on the below structure. An if statement with+// a guarding condition is generated for each ROOT slice.+//+// Pseudo code:+//+// Compute values of slice input operands+//+// Compute guarding_cond0+// if (guarding_cond0) {+//   Write to output of slice0+// }+//+// Compute guarding_cond1+// if (guarding_cond1) {+//   Write to output of slice1+// }+//+void IrEmitterUnnested::EmitElementForInputFusibleSlices(+    HloInstruction* unnested_hlo, const llvm_ir::IrArray::Index& index) {+  VLOG(10) << "Emitting slice input fusion for " << unnested_hlo->ToString();++  HloInstruction* slice_or_tuple = unnested_hlo->fused_expression_root();+  auto slice_instructions = [&]() -> absl::Span<HloInstruction* const> {+    if (slice_or_tuple->opcode() == HloOpcode::kSlice) {+      return absl::Span<HloInstruction* const>(&slice_or_tuple, 1);+    }+    CHECK(slice_or_tuple->opcode() == HloOpcode::kTuple);+    return slice_or_tuple->operands();+  }();++  // Emit input operand values of slices.+  std::vector<llvm::Value*> input_ir_values;+  GpuElementalIrEmitter elem_emitter(hlo_module_config_, module_, &b_,+                                     GetNestedComputer());+  FusedIrEmitter fused_emitter(GetGeneratorForOperandIrArrays(unnested_hlo),+                               &elem_emitter);+  TF_CHECK_OK(unnested_hlo->fused_expression_root()->Accept(&fused_emitter));+  for (const HloInstruction* slice : slice_instructions) {+    auto input_generator = fused_emitter.GetGenerator(slice->operand(0));+    input_ir_values.push_back(input_generator(index).ValueOrDie());+  }++  // Emit for slice_instructions.+  KernelSupportLibrary ksl(&b_, llvm_ir::UnrollMode::kDefaultUnroll);+  for (int64 i = 0; i < slice_instructions.size(); ++i) {+    HloInstruction* slice = slice_instructions[i];++    // guarding_cond := index >= start && index < limit, for each dim.+    std::vector<llvm::Value*> index_within_ranges;+    for (size_t dim = 0; dim < slice->slice_starts().size(); ++dim) {+      CHECK(slice->slice_strides(dim) == 1);+      auto larger_or_equal_than_start = b_.CreateICmpSGE(+          index.multidim()[dim],+          index.GetConstantWithIndexType(slice->slice_starts(dim)));+      llvm::Value* smaller_than_limit = b_.CreateICmpSLT(+          index.multidim()[dim],+          index.GetConstantWithIndexType(slice->slice_limits(dim)));+      llvm::Value* within_range =+          b_.CreateAnd(larger_or_equal_than_start, smaller_than_limit);+      index_within_ranges.push_back(within_range);+    }+    llvm::Value* guarding_cond = b_.CreateAnd(index_within_ranges);++    auto emit_slice_elem_func = [&]() -> void {

Similarly, brackets after [&] can be dropped.

trentlo

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 bool ImplementedAsLibraryCall(const HloInstruction& hlo); // kept are contiguous in the input of the reduce instruction. bool IsReductionFromOrToContiguousDimensions(const HloInstruction& reduce); +// Returns whether unnested_hlo is an input fusion whose root is either a slice+// or a tuple of slices. If verify_no_strides is true, return false unless all

"Returns" here as well.

trentlo

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 Status IrEmitterUnnested::EmitConstantGlobals() {   return Status::OK(); } +// Emits code for slices based on the below structure. An if statement with+// a guarding condition is generated for each ROOT slice.+//+// Pseudo code:+//+// Compute values of slice input operands+//+// Compute guarding_cond0+// if (guarding_cond0) {+//   Write to output of slice0+// }+//+// Compute guarding_cond1+// if (guarding_cond1) {+//   Write to output of slice1+// }+//+void IrEmitterUnnested::EmitElementForInputFusibleSlices(+    HloInstruction* unnested_hlo, const llvm_ir::IrArray::Index& index) {+  VLOG(10) << "Emitting slice input fusion for " << unnested_hlo->ToString();++  HloInstruction* slice_or_tuple = unnested_hlo->fused_expression_root();+  auto slice_instructions = [&]() -> absl::Span<HloInstruction* const> {+    if (slice_or_tuple->opcode() == HloOpcode::kSlice) {+      return absl::Span<HloInstruction* const>(&slice_or_tuple, 1);+    }+    CHECK(slice_or_tuple->opcode() == HloOpcode::kTuple);+    return slice_or_tuple->operands();+  }();++  // Emit input operand values of slices.+  std::vector<llvm::Value*> input_ir_values;+  GpuElementalIrEmitter elem_emitter(hlo_module_config_, module_, &b_,+                                     GetNestedComputer());+  FusedIrEmitter fused_emitter(GetGeneratorForOperandIrArrays(unnested_hlo),+                               &elem_emitter);+  TF_CHECK_OK(unnested_hlo->fused_expression_root()->Accept(&fused_emitter));+  for (const HloInstruction* slice : slice_instructions) {+    auto input_generator = fused_emitter.GetGenerator(slice->operand(0));+    input_ir_values.push_back(input_generator(index).ValueOrDie());+  }++  // Emit for slice_instructions.+  KernelSupportLibrary ksl(&b_, llvm_ir::UnrollMode::kDefaultUnroll);+  for (int64 i = 0; i < slice_instructions.size(); ++i) {+    HloInstruction* slice = slice_instructions[i];++    // guarding_cond := index >= start && index < limit, for each dim.+    std::vector<llvm::Value*> index_within_ranges;+    for (size_t dim = 0; dim < slice->slice_starts().size(); ++dim) {+      CHECK(slice->slice_strides(dim) == 1);+      auto larger_or_equal_than_start = b_.CreateICmpSGE(+          index.multidim()[dim],+          index.GetConstantWithIndexType(slice->slice_starts(dim)));+      llvm::Value* smaller_than_limit = b_.CreateICmpSLT(+          index.multidim()[dim],+          index.GetConstantWithIndexType(slice->slice_limits(dim)));+      llvm::Value* within_range =+          b_.CreateAnd(larger_or_equal_than_start, smaller_than_limit);+      index_within_ranges.push_back(within_range);+    }+    llvm::Value* guarding_cond = b_.CreateAnd(index_within_ranges);++    auto emit_slice_elem_func = [&]() -> void {+      const std::vector<llvm::Value*>& src_multidim = index.multidim();+      std::vector<llvm::Value*> dst_multidim(src_multidim.size());+      for (size_t dim = 0; dim < src_multidim.size(); ++dim) {+        dst_multidim[dim] =+            Sub(src_multidim[dim],+                index.GetConstantWithIndexType(slice->slice_starts(dim)));+      }+      ShapeIndex shape_index = (slice_or_tuple->opcode() == HloOpcode::kSlice)+                                   ? ShapeIndex()+                                   : ShapeIndex({i});+      llvm_ir::IrArray src_ir_array =+          GetIrArray(*unnested_hlo, *unnested_hlo, shape_index);+      IrArray::Index slice_dst_index(dst_multidim, slice->shape(),+                                     index.GetType());+      llvm::Value* dst_addr = src_ir_array.EmitArrayElementAddress(+          slice_dst_index, &b_, "slice.dest");+      b_.CreateStore(input_ir_values[i], dst_addr);+    };++    ksl.If(StrCat("slice", i), guarding_cond, emit_slice_elem_func);+  }+}++Status IrEmitterUnnested::EmitInputFusibleNonStridedSlices(+    HloInstruction* unnested_hlo) {+  constexpr int unroll_factor = 1;+  std::unique_ptr<KernelThunk> kernel_thunk = BuildKernelThunk(+      unnested_hlo, /*implements_whole_instruction=*/true, unroll_factor);++  TF_ASSIGN_OR_RETURN(Shape element_shape,+                      GetConsistentInputShapeForRootSlices(*unnested_hlo));+  LaunchDimensions launch_dimensions = CalculateLaunchDimensions(+      element_shape, ir_emitter_context_->device_description(), unroll_factor);+  UpdateLaunchDimensions(launch_dimensions, kernel_thunk.get(),+                         ir_emitter_context_->llvm_module());++  auto loop_body_generator =

Nitpick: I would just declare the lambda inline inside the call, but up to you.

trentlo

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 Status IrEmitterUnnested::EmitConstantGlobals() {   return Status::OK(); } +// Emits code for slices based on the below structure. An if statement with+// a guarding condition is generated for each ROOT slice.+//+// Pseudo code:+//+// Compute values of slice input operands+//+// Compute guarding_cond0+// if (guarding_cond0) {+//   Write to output of slice0+// }+//+// Compute guarding_cond1+// if (guarding_cond1) {+//   Write to output of slice1+// }+//+void IrEmitterUnnested::EmitElementForInputFusibleSlices(+    HloInstruction* unnested_hlo, const llvm_ir::IrArray::Index& index) {+  VLOG(10) << "Emitting slice input fusion for " << unnested_hlo->ToString();++  HloInstruction* slice_or_tuple = unnested_hlo->fused_expression_root();+  auto slice_instructions = [&]() -> absl::Span<HloInstruction* const> {

Actually you can just write [&] -> ...

trentlo

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 llvm::Type* GetIndexTypeForKernel(const HloInstruction* hlo, int64 launch_size,   return b->getInt32Ty(); } +// Get the input shape of the ROOT slices, which will be used as the kernel

Another nitpick: for function docstrings we use declarative rather than imperative, cf. https://google.github.io/styleguide/cppguide.html#Function_Comments

This is applicable here and below.

trentlo

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 bool ImplementedAsLibraryCall(const HloInstruction& hlo); // kept are contiguous in the input of the reduce instruction. bool IsReductionFromOrToContiguousDimensions(const HloInstruction& reduce); +// Whether it is an input fusion whose root is either a non-strided slice+// or a tuple of non-strided slices.+bool IsInputFusibleSlices(const HloInstruction& unnested_hlo,

Sorry for nitpicking, now the first two sentences almost duplicate each other. Can we write "Returns ..." with all the relevant information available after?

trentlo

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 Status IrEmitterUnnested::EmitConstantGlobals() {   return Status::OK(); } +// Overall, emit code for slices based on the below structure. A `slice_guard_i`+// and a `slice_i` are generated for each ROOT slice. `slice_guard_i` computes+// the guarding condition to decide whether it should jump into `slice_i`+// for writing to the slice output or continue to next `slice_guard_{i+1}`.+//+// init_block:+//   Compute values of slice input operands+//   Br slice_guard_0+//+// slice_guard_0:

I do not want to block your work on this, but it seems that structured flow would be easier to read and maintain. Do you want to try it and see whether it is indeed the case? I'm having a hard time following what exactly the emitter is doing due to all the stateful blocks.

trentlo

comment created time in 2 months

pull request commenttensorflow/tensorflow

[XLA] Add a new XLA mode: XLALite

@thomasjoerg we can also force run copybara by adding the kokoro:force-run tag, let me try to do it.

nouiz

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

Implement slice input fusion.

 Status IrEmitterUnnested::EmitConstantGlobals() {   return Status::OK(); } +// Overall, emit code for slices based on the below structure. A `slice_guard_i`+// and a `slice_i` are generated for each ROOT slice. `slice_guard_i` computes+// the guarding condition to decide whether it should jump into `slice_i`+// for writing to the slice output or continue to next `slice_guard_{i+1}`.+//+// init_block:+//   Compute values of slice input operands+//   Br slice_guard_0+//+// slice_guard_0:

Sorry for nitpicking, I'm still trying to understand whether we actually need non-structured control flow here.

Is the code snippet below equivalent to

if (guarding_cond0) {
  // Write to output of slice0
}
if (guarding_cond1) {
  // Write to output  of slice 1
}
...

?
trentlo

comment created time in 2 months

more