profile
viewpoint
George Karpenkov cheshire @google Cupertino, CA http://metaworld.me

cheshire/onumerical 8

Numerical Library for OCaml

cheshire/django 2

Official clone of the Subversion repository.

cheshire/antlr3-python-example 1

Simple calculator in python, using ANTLR V3

cheshire/django-css 1

django-css is a fork of django_compressor that makes it easy to use CSS compilers with your Django projects. CSS compilers extend CSS syntax to include more powerful features such as variables and nested blocks, and pretty much rock all around.

cheshire/django-whatever 1

Unobtrusive test models creation for django

cheshire/jstree 1

jquery tree plugin

cheshire/mongoengine 1

A Python Object-Document-Mapper for working with MongoDB

cheshire/opam-repository 1

Main public package repository for OPAM, the source package manager of OCaml.

issue commenttensorflow/tensorflow

XLA function crashed when called under GradientTape

0e4e0c593bf7957aefd29818e2d24caee00c841a fixes your original problem. Could you file another bug for a different problem you are seeing?

EgorLakomkin

comment created time in 9 hours

pull request commenttensorflow/tensorflow

[ROCm] Enforce host memory for DT_INT32 tensors

Sure, but you are changing the logic for everything. Do you know why is it crashing only on ROCm?

ekuznetsov139

comment created time in 10 hours

issue commenttensorflow/tensorflow

XLA function crashed when called under GradientTape

@EgorLakomkin I have a fix for the issue with both inner and outer function compiled, which should be released shortly.

Could you file a separate bug on XLA compilation requires that operator arguments that represent shapes or dimensions be evaluated to concrete values at compile time. and assign it to me? Also your colab link is requiring some permissions I don't have.

EgorLakomkin

comment created time in 13 hours

issue commenttensorflow/tensorflow

Support for TensorList crossing the XLA/TF boundary is not implemented

@EgorLakomkin This is expected behavior, this feature is not implemented. We do plan on doing this, but it is non-trivial.

TensorArray is supported in XLA, but not in crossing the TF/XLA boundary. The problem is that taking a derivative from outside exposes the TensorArray to the boundary, and the conversion is not implemented.

The workaround is to make the outer function compile (your simple_train), and potentially remove the inner compiled function (might be not necessary, but to avoid stumbling into your other issue in github.com/tensorflow/tensorflow/issues/39060)

EgorLakomkin

comment created time in 13 hours

pull request commenttensorflow/tensorflow

[XLA] Change the number of threads per block for row reduction

@nouiz I'm generally fine with merging this. Could you add some micro-benchmarks in the commit description demonstrating a speed-up in small artificial test cases?

nouiz

comment created time in 14 hours

pull request commenttensorflow/tensorflow

[ROCm] Enforce host memory for DT_INT32 tensors

That's a pretty huge non-ROCm specific change. Could you show the test cases where it is crashing otherwise? I also thing the tag [ROCm] is somewhat misleading here.

ekuznetsov139

comment created time in 14 hours

pull request commenttensorflow/tensorflow

[XLA] Remove cross device nodes from clusters

For example, this is the timeline of a transformer model

BTW for BERT specifically we have found we were able to get better performance with tf.function(experimental_compile=True): https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer.py#L228

Should XLA be able to cluster in this way? I'm actually not sure; @sanjoy any comments? But in general if you are changing behavior it needs to be tested

zhuzilin

comment created time in 5 days

pull request commenttensorflow/tensorflow

[ROCm] XLA enhancements & bugfixes

Hi,

For commits we generally try to have a self-descriptive message; is it possible to update the title of this PR? (or if there are multiple bugfixes/features, to split it up?)

ekuznetsov139

comment created time in 7 days

Pull request review commenttensorflow/tensorflow

[ROCm] hip-clang / ROCm 3.5 build fixes

 def make_copy_files_rule(repository_ctx, name, srcs, outs):     cmd = \"""%s \""", )""" % (name, "\n".join(outs), " && \\\n".join(cmds)) -def make_copy_dir_rule(repository_ctx, name, src_dir, out_dir):+def make_copy_dir_rule(repository_ctx, name, src_dir, out_dir, exceptions=None):

Could you extend the docstring to clarify the semantics of exceptions?

ekuznetsov139

comment created time in 11 days

Pull request review commenttensorflow/tensorflow

[ROCm] hip-clang / ROCm 3.5 build fixes

 bool GpuExecutor::UnloadGpuBinary(const void* gpu_binary) {     VLOG(3) << "Unloading  HSACO module " << module;     GpuDriver::UnloadModule(context_, module);     gpu_binary_to_module_.erase(module_it);+    const char* mem_it = nullptr;+    for (auto x : in_memory_modules_)

XLA code style is curly braces in this case

ekuznetsov139

comment created time in 11 days

pull request commenttensorflow/tensorflow

Implementation of relational operator for complex numbers in XLA

Matching numpy semantics makes sense to me, could we add tests?

joschkabraun

comment created time in 11 days

issue commenttensorflow/tensorflow

XLA function crashed when called under GradientTape

@EgorLakomkin There is one bug and one missing feature here:

  1. Nested tf.function(experimental_compile=True) should not affect the result at all, but here it does (the error you are seeing)
  2. If you remove the outer tf.function you'll get the missing feature: TensorArray transition across TF/XLA boundary is not implemented.

If you are fine with keeping the outer tf.function(experimental_compile=True) you should be able to do it as you are doing now and keep good performance, right? Or do you want to ship it as a library where you have no control over the user code which might take a derivative?

EgorLakomkin

comment created time in 13 days

issue commenttensorflow/tensorflow

XLA function crashed when called under GradientTape

@EgorLakomkin As a workaround, you can remove @tf.function(experimental_compile=True) around the call method inside the layer. Here it does not do anything (since the whole computation is compiled by the outer function), and is causing the crash.

In general, TensorArray handling in XLA does have some sharp edges...

EgorLakomkin

comment created time in 14 days

Pull request review commenttensorflow/tensorflow

[XLA] Fix a bug in the copy operations

 class KernelMappingScheme {   KernelMappingScheme(absl::Span<const int64> dims_in_elems,                       absl::Span<const int64> tile_sizes, int64 num_threads_y,                       int64 num_threads_x, IndexingOrder indexing_order,-                      int vector_size)+                      int vector_size, bool row_contiguous = false)

is_row_contiguous

nouiz

comment created time in 18 days

Pull request review commenttensorflow/tensorflow

[XLA] Fix a bug in the copy operations

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"++namespace xla {+namespace gpu {++namespace {++// WARNING: This tests must be alone in its file!  Otherwise, the+// error isn't caught. We expect and CUDA_ERROR_ILLEGAL_ADDRESS to be

"a CUDA...".

Perhaps the error is always reproducible with cuda-memcheck? We have never finished the project to setup a CI bot with it.

nouiz

comment created time in 18 days

issue commenttensorflow/tensorflow

[XLA] bazel tests broken

Adding linkstatic=1 to the nvptx_compiler target seems to fix the issue, but I'm trying to figure out what would be the unintended consequences.

nouiz

comment created time in 20 days

issue commenttensorflow/tensorflow

[XLA] bazel tests broken

I can reproduce the issue.

nouiz

comment created time in 20 days

issue commenttensorflow/tensorflow

experimental_compile regression in 2.2.rc4

The contract says that experimental_compile=True should fail explicitly, and not provide a fallback (cf. tensorflow.org/xla)

ngc92

comment created time in 22 days

issue commenttensorflow/tensorflow

experimental_compile regression in 2.2.rc4

Hi, the error message should be improved, but it basically says that the code in tf.strings.split is not compilable by XLA. What is the expected behavior here if this is a bug?

ngc92

comment created time in 22 days

Pull request review commenttensorflow/tensorflow

[ROCm] Updates for XLA unit-tests on the ROCm platform

 Status GpuCompiler::OptimizeHloPostLayoutAssignment(     pipeline.AddPass<HloPassFix<GpuTreeReductionRewriter>>();   } +  // GemmRewriter assumes that all transposes are folded into gemms, but,+  // since commit 7d529df, this is not always true at this point.+  // Therefore, rerun transpose folding.+  pipeline.AddPass<TransposeFolding>(

This is not rocm-specific, right? Could you have a separate PR for this, ideally with a test case demonstrating the issue? Thanks!!

deven-amd

comment created time in a month

issue commenttensorflow/tensorflow

XLA without autoclustering?

Yup, that's why I have asked whether it is latest TF.

ekuznetsov139

comment created time in a month

issue commenttensorflow/tensorflow

XLA without autoclustering?

@ekuznetsov139 Is there a simple reproducer I could try to see this problem? If you are willing to dive deep, could you see why mark_for_compilation_pass is doing anything at all? There is a filter which should only consider nodes marked with kXlaMustCompileAttr (those are created by tf.function(experimental_compile=True)) when the global JIT setting is on.

ekuznetsov139

comment created time in a month

issue commenttensorflow/tensorflow

[2.2] XLA requires 2x GPU memory with sparse_categorical_crossentropy

@lgeiger Thanks for looking into this! XLA in general makes the memory fragmentation worse, but you are right it should not increase the memory consumption by that much.

We'll track this bug, but in general we have found that it is very difficult to deal with such problems when using autoclustering: so we try to use explicit compilation scopes with tf.function(experimental_compile=True) instead. If you could change the test case to use that, it would be very helpful (but I understand if it's not possible, e.g. here the code inside the lqz model would probably need to be annotated). Investing time into annotation could also make the possible performance impact more apparent though (by identifying a chunk in the profiler which is too slow and should be optimized, and adding an explicit annotation around that).

lgeiger

comment created time in a month

issue commenttensorflow/tensorflow

XLA without autoclustering?

Are you using the nightly build?

https://www.tensorflow.org/xla/#explicit_compilation_with_tffunction seems to imply that it is possible to enable XLA for parts of the graph without involving autoclustering.

Yes!

I'm adding the tf.function tag to a function 'gelu' that contains tf.tanh and some cwise arithmetics, but not touching global jit options, etc

Yes, that makes sense, it should be enough to add the annotation @tf.function(experimental_compile=True)

My expectation is to produce a number of identical clusters implementing this 'gelu', and possibly a number of clusters implementing its gradients

Basically yes, though calling them "clusters" can cause confusion because it's a separate mechanism from autoclustering. It will note produce identical clusters during reuse, but it would have to recompile for each new shape.

What I'm actually seeing instead, is a number of clusters that seem to be greedily constructed around 'gelu's, but are also sweeping everything in the neighborhood, up to and including GEMMs:

mark_for_compilation_pass is for autoclustering, and actually should not be doing anything at all for tf.function. What is the exact command you are using? Is autoclustering definitely off?

Is this the intended behavior, and can this be avoided?

It honestly sounds like the autoclustering mode is on.

I should note that I see this even if I set autoclustering policy in compiler/jit/xla_gpu_device.cc to kIfExplicitlyRequested.

Are you using the XLA_GPU devices? They are deprecated and cause a lot of confusion and we are trying to remove them, but it takes time. A simple answer here is not to use XLA_GPU device and not to change any settings there.

MarkForCompilationPassImpl::RunEdgeContractionLoop() is executed regardless of the global JIT setting

I think it's always executed, but it should not do anything if nothing is marked for compilation?

ekuznetsov139

comment created time in a month

pull request commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size (new version of PR #37260)

@nouiz I see small but noticeable regression on this input HLO: https://gist.github.com/cheshire/a7e5e546d5feafdc3db97ef1d89baa91, could you take a look?

nouiz

comment created time in 2 months

issue commenttensorflow/tensorflow

Failed to find bogomips warning

@ymodak Why is this assigned to me?

LeeJH

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static IrArray::Index GetUnnormalizedIndex(   // If the normalization only add a new dimensions of size 1,   // generate simpler indexing. LLVM doesn't always simplify the more   // complicated indexing and this prevents it from vectorizing some-  // cases.-  if (unnormalized_shape.rank() == 2) {+  // cases. We do this only for major_to_minor memory layout.+  if (unnormalized_shape.rank() == 2 && unnormalized_shape.has_layout() &&+      unnormalized_shape.dimensions()[0] == normalized_shape_index.dims()[1] &&+      unnormalized_shape.dimensions()[1] == normalized_shape_index.dims()[2] &&+      unnormalized_shape.layout().minor_to_major(1) == 0) {     DCHECK_EQ(normalized_shape_index.dims()[0], 0);

Going further, can we make it a CHECK_EQ instead?

nouiz

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(     return kWarpSize;   }(); +  bool tile_fit = reduction_dimensions.dimensions[kDimX] %+                      (reduction_tiling[2] * num_threads_x) ==+                  0;++  int cc_major = 0, cc_minor = 0;+  ir_emitter_context_->device_description().cuda_compute_capability(&cc_major,+                                                                    &cc_minor);++  int num_partial_results = 1;+  KernelMappingScheme::IndexingOrder indexing_order = [&]() {+    if (reduction_dimensions.is_row_reduction &&+        // P100, only ttry to vectorize+coales memory access when the+        // tile size fit exactly and dtypes <= 32 bits

"fits"

nouiz

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(     return kWarpSize;   }(); +  bool tile_fit = reduction_dimensions.dimensions[kDimX] %+                      (reduction_tiling[2] * num_threads_x) ==+                  0;++  int cc_major = 0, cc_minor = 0;+  ir_emitter_context_->device_description().cuda_compute_capability(&cc_major,+                                                                    &cc_minor);++  int num_partial_results = 1;+  KernelMappingScheme::IndexingOrder indexing_order = [&]() {+    if (reduction_dimensions.is_row_reduction &&+        // P100, only ttry to vectorize+coales memory access when the

"try"

nouiz

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static IrArray::Index GetUnnormalizedIndex(     const Shape& unnormalized_shape, llvm::IRBuilder<>* b_,     const KernelMappingScheme& kernel_mapping_scheme) {   DCHECK_EQ(normalized_shape_index.size(), 3);+  // If the normalization only add a new dimensions of size 1,+  // generate simpler indexing. LLVM doesn't always simplify the more+  // complicated indexing and this prevent him from vectorizing some

"prevents it"

nouiz

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static IrArray::Index GetUnnormalizedIndex(     const Shape& unnormalized_shape, llvm::IRBuilder<>* b_,     const KernelMappingScheme& kernel_mapping_scheme) {   DCHECK_EQ(normalized_shape_index.size(), 3);+  // If the normalization only add a new dimensions of size 1,+  // generate simpler indexing. LLVM doesn't always simplify the more+  // complicated indexing and this prevent him from vectorizing some+  // cases.+  if (unnormalized_shape.rank() == 2) {+    DCHECK_EQ(normalized_shape_index.dims()[0], 0);+    auto multidim = normalized_shape_index.multidim();+    return IrArray::Index({multidim[1], multidim[2]}, unnormalized_shape,

Sorry, I am a bit confused on what is going on here, this code is new. Why rank 2? Isn't [2] out of bounds in this case?

nouiz

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] Create a wrapper header for rocprim and cub

 cc_library(     deps = ["//third_party/eigen3"], ) +cc_library(

Any change (however minor) seems to kick the PR back to the end of the queue I'm not sure that's how it works, usually once the reviewer is looking into it, the turnaround should be quite reasonable.

The system (kinda sorta) works when a correct reviewer is assigned (either manually, or by the triaging team). At this point, for AMD-specific changes it's probably either @chsigg or me (though I normally only review XLA-specific stuff, but I can also help with general reviews).

ekuznetsov139

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] Create a wrapper header for rocprim and cub

 cc_library(     deps = ["//third_party/eigen3"], ) +cc_library(

Point taken, but I wasn't really able to work for the last couple of weeks, and I think the PR got stuck in limbo.

ekuznetsov139

comment created time in 2 months

pull request commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

@nouiz I'm a bit confused by your benchmark, do you mean the runtime reduction in fusion_1 from 39us to 33us?

nouiz

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] re-enable the test //tensorflow/python:auto_mixed_precision_test_gpu on ROCm

 def test_conv_bn(self):       self.assertEqual(num_to_fp16,                        3)  # Before Conv2D:0, Conv2D:1, Conv2D_1:1       self.assertEqual(num_to_fp32, 1)  # After FusedBatchNormV3:0-      self.assertAllClose(output_val_ref, output_val, atol=1e-3, rtol=1e-3)+      # Bump up the tolerance for the ROCm platform+      # The default tolerance (1e-3) results in a tiny fraction (<1%) of+      # miscompares on ROCm platform, and hence the tolerance bump+      tol = 2e-3 if test.is_built_with_rocm else 1e-3

Can we just bump it up for all platforms?

deven-amd

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] CUDA/ROCm shared interface

 limitations under the License. #if GOOGLE_CUDA #include "third_party/gpus/cuda/include/cuComplex.h" #include "third_party/gpus/cuda/include/cuda.h"+#else+#include "rocm/include/hip/hip_complex.h" #endif+ #include "tensorflow/core/platform/types.h" #include "tensorflow/core/util/gpu_cuda_alias.h" -namespace tensorflow {+#if GOOGLE_CUDA +using gpuFloatComplex = cuFloatComplex;+using gpuDoubleComplex = cuDoubleComplex;+#define gpuEventRecord cudaEventRecord+#define gpuEventSynchronize cudaEventSynchronize+#define gpuEventDestroy cudaEventDestroy+#define gpuEventCreate cudaEventCreate+#define gpuEventCreateWithFlags cudaEventCreateWithFlags+#define gpuEventDisableTiming cudaEventDisableTiming+typedef cudaStream_t gpuStream_t;+typedef cudaEvent_t gpuEvent_t;

Could those be done using using for consistency?

ekuznetsov139

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] Create a wrapper header for rocprim and cub

 cc_library(     deps = ["//third_party/eigen3"], ) +cc_library(

Would cub_or_rocprim be a better name?

ekuznetsov139

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] Reverting ROCm to use MIOpen Find Mode APIs (be default) for convolution

 std::vector<AlgorithmDesc> GetAlgorithms(CudnnConvKind kind,   return algorithms; } -StatusOr<std::vector<se::dnn::ProfileResult>> GetAlgorithms(+StatusOr<std::vector<se::dnn::ProfileResult>> GetMIOpenAlgorithms(

Makes sense!

deven-amd

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] Reverting ROCm to use MIOpen Find Mode APIs (be default) for convolution

 std::vector<AlgorithmDesc> GetAlgorithms(CudnnConvKind kind,   return algorithms; } -StatusOr<std::vector<se::dnn::ProfileResult>> GetAlgorithms(+StatusOr<std::vector<se::dnn::ProfileResult>> GetMIOpenAlgorithms(

I'm not sure this change makes sense, this function is used for both ROCm and CUDA? Isn't MIOpen ROCm-specific?

deven-amd

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] Optimized training apply ops

 struct ApplyGradientDescent<GPUDevice, T> {   } }; +#if TENSORFLOW_USE_ROCM++#include "rocm/include/hip/hip_complex.h"++// if any kernels involving complex sqrt/rsqrt are compiled with ROCm, build+// process completes without errors,but the resulting executable ends up+// unusable (throwing errors "no device code available for function" for+/// completely unrelated kernels.)+// We also can't cast to hipFloatComplex etc. because (as of 2020-01) HIP does+// not provide sqrt for complex.+// We have no choice but to implement sqrt and rsqrt by hand+template <typename T>+__device__ T impl_sqrt(T x) {+  return sqrt(x);+}+template <typename T>+__device__ T impl_rsqrt(T x) {+  return rsqrt(x);+}+template <>+__device__ Eigen::half impl_sqrt(Eigen::half x) {+  return __float2half(sqrt(__half2float(x)));+}+template <>+__device__ Eigen::half impl_rsqrt(Eigen::half x) {+  return __float2half(rsqrt(__half2float(x)));+}++template <class T>+__device__ std::complex<T> impl_sqrt(std::complex<T> x) {+  T re = x.real(), im = x.imag();+  T mod_x = sqrt(re * re + im * im);+  const T root2 = 0.7071067811865475;+  // we pick the root with the same sign of the imaginary component as the input+  T root[2] = {T(sqrt(mod_x + re) * root2),+               T(sqrt(mod_x - re) * root2 * (im >= 0 ? 1. : -1.))};+  // hcc/clang is really weird with its support of complex in device code;+  // for some reason it does not permit a 2-argument constructor+  return *(reinterpret_cast<std::complex<T>*>(&root));+}++template <class T>+__device__ T rsqrt_helper(T x) {+  return 0.5 * x + 0.125 * x * x + 0.0625 * x * x * x;+}++template <class T>+__device__ std::complex<T> impl_rsqrt(std::complex<T> x) {+  T re = x.real(), im = x.imag();+  T r = rsqrt(re * re + im * im);+  T ar2 = re * r * r;+  const T root2 = 0.7071067811865475;+  T root[2];+  // With float, calculating 1+re*r and 1-re*r may result in excessive errors+  // due to subtraction of two close values. We have to get fancy+  root[0] = sqrt(r * ((std::is_same<T, float>::value && re * r < -0.98)+                          ? rsqrt_helper(im * im * r * r)+                          : 1 + re * r)) *+            root2;+  root[1] = sqrt(r * ((std::is_same<T, float>::value && re * r > 0.98)+                          ? rsqrt_helper(im * im * r * r)+                          : 1 - re * r)) *+            root2 * (im >= 0 ? -1. : 1.);+  return *(reinterpret_cast<std::complex<T>*>(&root));+}++template <typename T>+__global__ void ApplyAdagradKernel(GpuLaunchConfig cfg, T* var, T* accum,+                                   const T* lr, const T* grad,+                                   bool update_slots) {+  GPU_1D_KERNEL_LOOP(i, cfg.virtual_thread_count) {+    if (update_slots) accum[i] += grad[i] * grad[i];+    var[i] -= lr[0] * grad[i] * impl_rsqrt(accum[i]);+  }+}++template <typename T>+__global__ void ApplyAdagradV2Kernel(GpuLaunchConfig cfg, T* var, T* accum,+                                     const T* lr, const T* epsilon,+                                     const T* grad, bool update_slots) {+  GPU_1D_KERNEL_LOOP(i, cfg.virtual_thread_count) {+    if (update_slots) accum[i] += grad[i] * grad[i];+    T update = grad[i] / (impl_sqrt(accum[i]) + epsilon[0]);+    var[i] -= lr[0] * update;+  }+}++template <typename T>+__global__ void ApplyAdadeltaKernel(GpuLaunchConfig cfg, T* var, T* accum,+                                    T* accum_update, const T* plr,+                                    const T* prho, const T* peps,+                                    const T* grad) {+  T rho = prho[0];+  T eps = peps[0];+  T lr = plr[0];+  GPU_1D_KERNEL_LOOP(i, cfg.virtual_thread_count) {+    accum[i] = accum[i] * rho + grad[i] * grad[i] * (T(1.0) - rho);+    T update =+        impl_sqrt(accum_update[i] + eps) * grad[i] * impl_rsqrt(accum[i] + eps);+    var[i] -= update * lr;+    accum_update[i] = accum_update[i] * rho + update * update * (T(1.0) - rho);+  }+}++template <typename T>+__global__ void ApplyRMSPropKernel(GpuLaunchConfig cfg, T* var, T* ms, T* mom,+                                   const T* plr, const T* prho,+                                   const T* pmomentum, const T* peps,+                                   const T* grad) {+  T rho = prho[0];+  T eps = peps[0];+  T lr = plr[0];+  T momentum = pmomentum[0];+  GPU_1D_KERNEL_LOOP(i, cfg.virtual_thread_count) {+    ms[i] += (T(1.0) - rho) * (grad[i] * grad[i] - ms[i]);+    mom[i] = mom[i] * momentum + lr * grad[i] * impl_rsqrt(eps + ms[i]);+    var[i] -= mom[i];+  }+}++template <typename T>+__global__ void ApplyCenteredRMSPropKernel(GpuLaunchConfig cfg, T* var, T* mg,+                                           T* ms, T* mom, const T* plr,+                                           const T* prho, const T* pmomentum,+                                           const T* peps, const T* grad) {+  T rho = prho[0];+  T eps = peps[0];+  T lr = plr[0];+  T momentum = pmomentum[0];+  T one_minus_rho = T(1.0) - rho;+  GPU_1D_KERNEL_LOOP(i, cfg.virtual_thread_count) {+    ms[i] += one_minus_rho * (grad[i] * grad[i] - ms[i]);+    mg[i] += one_minus_rho * (grad[i] - mg[i]);+    T denom = (ms[i] - mg[i] * mg[i]) + eps;+    mom[i] = mom[i] * momentum + lr * grad[i] * impl_rsqrt(denom);+    var[i] -= mom[i];+  }+}+#endif++#if TENSORFLOW_USE_ROCM

Nitpick: why not remove this #if and the previous #endif?

ekuznetsov139

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[ROCm] Optimized training apply ops

 struct ApplyGradientDescent<GPUDevice, T> {   } }; +#if TENSORFLOW_USE_ROCM++#include "rocm/include/hip/hip_complex.h"++// if any kernels involving complex sqrt/rsqrt are compiled with ROCm, build+// process completes without errors,but the resulting executable ends up+// unusable (throwing errors "no device code available for function" for+/// completely unrelated kernels.)+// We also can't cast to hipFloatComplex etc. because (as of 2020-01) HIP does+// not provide sqrt for complex.+// We have no choice but to implement sqrt and rsqrt by hand+template <typename T>+__device__ T impl_sqrt(T x) {+  return sqrt(x);+}+template <typename T>+__device__ T impl_rsqrt(T x) {+  return rsqrt(x);+}+template <>+__device__ Eigen::half impl_sqrt(Eigen::half x) {+  return __float2half(sqrt(__half2float(x)));+}+template <>+__device__ Eigen::half impl_rsqrt(Eigen::half x) {+  return __float2half(rsqrt(__half2float(x)));+}++template <class T>+__device__ std::complex<T> impl_sqrt(std::complex<T> x) {+  T re = x.real(), im = x.imag();+  T mod_x = sqrt(re * re + im * im);+  const T root2 = 0.7071067811865475;+  // we pick the root with the same sign of the imaginary component as the input

Nitpick: comments have to be full sentences, start with a capital letter, and end with a full stop.

ekuznetsov139

comment created time in 2 months

pull request commenttensorflow/tensorflow

[ROCm] Fix for a test regression on the ROCm platform - 200207 - 1

Sure, seems like a weird compiler bug? Accepting.

deven-amd

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 bool MayPreventVectorization(const HloInstruction& hlo) {       default:         return false;     }+  } else if (hlo.opcode() == HloOpcode::kReduce) {+    // TODO: check if the to_apply() attribute contain instruction

"contains"

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  auto constant = [&](int64 val) {+    return llvm::ConstantInt::get(index_ty, val);+  };+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    return b->CreateMul(thread_id_x, constant(mapping_scheme.GetVectorSize()));   }+  CHECK_EQ(mapping_scheme.GetIndexingOrder(), kLinearIndexingX);   int64 x_num_steps =       mapping_scheme.GetTileSizeX() / mapping_scheme.GetNumThreadsX();-  return b->CreateMul(thread_id_x,-                      llvm::ConstantInt::get(index_ty, x_num_steps));+  return b->CreateMul(thread_id_x, constant(x_num_steps));+}++// This function calls emit_elem_function() x_num_steps times.  If

Just calls, using backticks for function and variable names would make this more readable.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  auto constant = [&](int64 val) {+    return llvm::ConstantInt::get(index_ty, val);+  };+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    return b->CreateMul(thread_id_x, constant(mapping_scheme.GetVectorSize()));   }+  CHECK_EQ(mapping_scheme.GetIndexingOrder(), kLinearIndexingX);   int64 x_num_steps =       mapping_scheme.GetTileSizeX() / mapping_scheme.GetNumThreadsX();-  return b->CreateMul(thread_id_x,-                      llvm::ConstantInt::get(index_ty, x_num_steps));+  return b->CreateMul(thread_id_x, constant(x_num_steps));+}++// This function calls emit_elem_function() x_num_steps times.  If+// vector_size==1, then each element index passed to+// emit_elem_function() will be separated by step_x. If vector_size>1,+// then it must be a multiple of x_num_steps.  In that case, it+// triggers a different indexing order that is vectorizable by+// LLVM. It generates many groups of calls to emit_elem_function. Each+// group is separated by step_x elements.  Inside a group, elements+// are consecutive. If check_x_tile_bounds is true, then it will check+// if the element index is in bound compared to tile_width before+// calling emit_elem_function.+static void Unroll(

We need a more self-descriptive name than Unroll, perhaps UnrollInnerTileLoopForVectorization?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(              llvm::Value* y_loc =                  b_.CreateAdd(thread_id_info.thread_id_y,                               b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }+             auto unroll = [&](bool add_index_boundary_condition) {+               return Unroll(add_index_boundary_condition, x_num_steps, step_x,+                             vector_size, loop_name, ksl, start_offset_x, y_loc,+                             tile_width, source_idx, b_, &emit_elem_function);+             };++             // Only try this path when we try to vectorize the loads.+             // Special case when the tile doesn't fit completly for even row

"completely"

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 bool MayPreventVectorization(const HloInstruction& hlo) {       default:         return false;     }+  } else if (hlo.opcode() == HloOpcode::kReduce) {+    return false;

Reduction doesn't prevent vectorization. This is the point

I understand, I mean: could you specify in the comment explicitly what needs to be checked and why, e.g. "return false only for row reductions which are ..."?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {+ protected:+  void EnsureDeterminism(absl::string_view hlo_text) {

Why do you need this method?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 class KernelMappingScheme {   // Number of threads used to process elements in the Y direction of a tile.   const int64 num_threads_y_; -  // When num_threads_x threads process a total of tile_size_x elements in the-  // X dimension of a tile, each threads process n=tile_size_x/num_threads_x-  // elements. When dilated_x=false, the n elements processed by a thread are-  // contiguous. On the other hand, when dilated_x=true the n elements are-  // dilated by a factor of num_threads_x.-  const bool dilated_x_;+  // When num_threads_x threads process a total of tile_size_x+  // elements in the X dimension of a tile, each threads process+  // n=tile_size_x/num_threads_x elements.+  // indexing_order_ define which tile's elements each thread reads.

"indexing_order", "defines", no separating newline.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(              llvm::Value* y_loc =                  b_.CreateAdd(thread_id_info.thread_id_y,                               b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }+             auto unroll = [&](bool add_index_boundary_condition) {+               return Unroll(add_index_boundary_condition, x_num_steps, step_x,+                             vector_size, loop_name, ksl, start_offset_x, y_loc,+                             tile_width, source_idx, b_, &emit_elem_function);+             };++             // Only try this path when we try to vectorize the loads.+             // Special case when the tile doesn't fit completly for even row+             // size. For odd row size every other row isn't aligned, so can't+             // be vectorized this way by LLVM.+             if (!x_tile_fits &&+                 mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+               ksl->If(loop_name + "_is_full_tile",+                       // if (block fully fit) {fast path} else {slow path}+                       // tile_width is always exact. For the last block,

I don't understand what does it mean to be always "exact"?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(       !IsUnrollingColumnReductionBeneficial(unnested_hlo, input_shape,                                             reduction_dimensions.dimensions[2]); -  if (!dilated_x && !reduction_dimensions.is_row_reduction) {+  KernelMappingScheme::IndexingOrder indexing_order = [&]() {+    if (reduction_dimensions.is_row_reduction &&+        // Only try to vectorize+coales memory access for row of even size.

"for rows"

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(     return kWarpSize;   }(); +  int tile_size_x = reduction_tiling[2] * num_threads_x;++  int vector_size = 1;+  if (indexing_order == kLinearStridedIndexingX) {+    if (reduction_dimensions.dimensions[2] % 2 == 0 &&+        // Assuming XLA will perform the unrolling and LLVM will vectorize,+        // disable the unroll for case that LLVM doesn't vectorize.

"for the case"

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(              llvm::Value* y_loc =                  b_.CreateAdd(thread_id_info.thread_id_y,                               b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }+             auto unroll = [&](bool add_index_boundary_condition) {+               return Unroll(add_index_boundary_condition, x_num_steps, step_x,+                             vector_size, loop_name, ksl, start_offset_x, y_loc,+                             tile_width, source_idx, b_, &emit_elem_function);+             };++             // Only try this path when we try to vectorize the loads.+             // Special case when the tile doesn't fit completly for even row+             // size. For odd row size every other row isn't aligned, so can't

"so it can't be vectorized by LLVM"

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  auto constant = [&](int64 val) {+    return llvm::ConstantInt::get(index_ty, val);+  };+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    return b->CreateMul(thread_id_x, constant(mapping_scheme.GetVectorSize()));   }+  CHECK_EQ(mapping_scheme.GetIndexingOrder(), kLinearIndexingX);   int64 x_num_steps =       mapping_scheme.GetTileSizeX() / mapping_scheme.GetNumThreadsX();-  return b->CreateMul(thread_id_x,-                      llvm::ConstantInt::get(index_ty, x_num_steps));+  return b->CreateMul(thread_id_x, constant(x_num_steps));+}++auto Unroll(bool add_index_boundary_condition, int64 x_num_steps, int64 step_x,
  • a short docstring on what this function is doing
nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(              llvm::Value* y_loc =                  b_.CreateAdd(thread_id_info.thread_id_y,                               b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }+             auto unroll = [&](bool add_index_boundary_condition) {+               return Unroll(add_index_boundary_condition, x_num_steps, step_x,+                             vector_size, loop_name, ksl, start_offset_x, y_loc,+                             tile_width, source_idx, b_, &emit_elem_function);+             };++             // Only try this path when we try to vectorize the loads.+             // Special case when the tile doesn't fit completly for even row+             // size. For odd row size every other row isn't aligned, so can't+             // be vectorized this way by LLVM.+             if (!x_tile_fits &&+                 mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+               ksl->If(loop_name + "_is_full_tile",+                       // if (block fully fit) {fast path} else {slow path}

I would remove this line entirely, I think "fast path"/"slow path" is somewhat misleading, as we don't get vectorization at all on the slow path.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  auto constant = [&](int64 val) {+    return llvm::ConstantInt::get(index_ty, val);+  };+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    return b->CreateMul(thread_id_x, constant(mapping_scheme.GetVectorSize()));   }+  CHECK_EQ(mapping_scheme.GetIndexingOrder(), kLinearIndexingX);   int64 x_num_steps =       mapping_scheme.GetTileSizeX() / mapping_scheme.GetNumThreadsX();-  return b->CreateMul(thread_id_x,-                      llvm::ConstantInt::get(index_ty, x_num_steps));+  return b->CreateMul(thread_id_x, constant(x_num_steps));+}++auto Unroll(bool add_index_boundary_condition, int64 x_num_steps, int64 step_x,
  • Also perhaps a better name would be check_x_tile_bounds?
nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(       !IsUnrollingColumnReductionBeneficial(unnested_hlo, input_shape,                                             reduction_dimensions.dimensions[2]); -  if (!dilated_x && !reduction_dimensions.is_row_reduction) {+  KernelMappingScheme::IndexingOrder indexing_order = [&]() {+    if (reduction_dimensions.is_row_reduction &&+        // Only try to vectorize+coales memory access for row of even size.+        // For odd row size, every other row isn't aligned, so can't be

"For odd row sizes", "so it can't"

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  auto constant = [&](int64 val) {+    return llvm::ConstantInt::get(index_ty, val);+  };+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    return b->CreateMul(thread_id_x, constant(mapping_scheme.GetVectorSize()));   }+  CHECK_EQ(mapping_scheme.GetIndexingOrder(), kLinearIndexingX);   int64 x_num_steps =       mapping_scheme.GetTileSizeX() / mapping_scheme.GetNumThreadsX();-  return b->CreateMul(thread_id_x,-                      llvm::ConstantInt::get(index_ty, x_num_steps));+  return b->CreateMul(thread_id_x, constant(x_num_steps));+}++auto Unroll(bool add_index_boundary_condition, int64 x_num_steps, int64 step_x,

static void?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(              llvm::Value* y_loc =                  b_.CreateAdd(thread_id_info.thread_id_y,                               b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }+             auto unroll = [&](bool add_index_boundary_condition) {+               return Unroll(add_index_boundary_condition, x_num_steps, step_x,+                             vector_size, loop_name, ksl, start_offset_x, y_loc,+                             tile_width, source_idx, b_, &emit_elem_function);+             };++             // Only try this path when we try to vectorize the loads.

"when we are vectorizing the loads" I think it more clear, we are not merely trying.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  auto constant = [&](int64 val) {+    return llvm::ConstantInt::get(index_ty, val);+  };+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    return b->CreateMul(thread_id_x, constant(mapping_scheme.GetVectorSize()));   }+  CHECK_EQ(mapping_scheme.GetIndexingOrder(), kLinearIndexingX);   int64 x_num_steps =       mapping_scheme.GetTileSizeX() / mapping_scheme.GetNumThreadsX();-  return b->CreateMul(thread_id_x,-                      llvm::ConstantInt::get(index_ty, x_num_steps));+  return b->CreateMul(thread_id_x, constant(x_num_steps));+}++auto Unroll(bool add_index_boundary_condition, int64 x_num_steps, int64 step_x,+            int64 vector_size, const string& loop_name,+            KernelSupportLibrary* ksl, llvm::Value* start_offset_x,+            llvm::Value* y_loc, llvm::Value* tile_width,+            IrArray::Index& source_idx, llvm::IRBuilder<>& b_,+            const IrEmitterUnnested::EmitElementFunction* emit_elem_function) {+  llvm::Type* index_ty = tile_width->getType();+  auto constant = [&](int64 val) {+    return llvm::ConstantInt::get(index_ty, val);+  };+  for (int j = 0; j < x_num_steps / vector_size; j++) {+    for (int i = 0; i < vector_size; i++) {+      int linear_index = j * vector_size + i;+      llvm::Value* x_loc = b_.CreateAdd(constant(j * step_x * vector_size + i),+                                        start_offset_x, "x_loc");+      IrArray::Index source_idx_x =+          source_idx.AddOffsetToDim(y_loc, kDimY, &b_)+              .AddOffsetToDim(constant(j * step_x * vector_size + i), kDimX,+                              &b_);+      auto emit_element = [&] {+        return (*emit_elem_function)(source_idx_x, y_loc, x_loc, linear_index);

You don't need to explicitly dereference the function pointer, emit_elem_function(...) will do.

nouiz

comment created time in 3 months

pull request commenttensorflow/tensorflow

[XLA] Compare only the filename, not the full path.

Could you give more context for this change? Is it possible to write a test?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] follow-up on GPU-deterministic reductions

 Status GpuCompiler::PrepareHloModuleForIrEmitting(HloModule* hlo_module) { // TODO(cheshire): Duplication with gpu_conv_algorithm picker, figure out a // right way to share this. static bool RequireDeterminism() {-  bool deterministic_ops = false;-  TF_CHECK_OK(tensorflow::ReadBoolFromEnvVar("TF_DETERMINISTIC_OPS",-                                             /*default_val=*/false,-                                             &deterministic_ops));-  return deterministic_ops;+  static bool require_determinism = [] {

Is it too much to ask to extract a common function with gpu_conv_algorithm_picker while we're at it?

duncanriach

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(     return kWarpSize;   }(); +  int tile_size_x = reduction_tiling[2] * num_threads_x;++  int vector_size = 1;+  if (indexing_order == kLinearStridedIndexingX) {+    if (reduction_dimensions.dimensions[2] % 2 == 0 &&+        // Assuming XLA will perform the unrolling and LLVM will vectorize,+        // disable the unroll for case that LLVM doesn't vectorize.+        !MayPreventVectorization(*unnested_hlo)) {

Is this why the change to MayPreventVectorization was necessary? Or is that for fused computations as well?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 bool MayPreventVectorization(const HloInstruction& hlo) {                               case HloOpcode::kPower:                               case HloOpcode::kAtan2:                                 return true;+                              case HloOpcode::kReduce:

I would still just remove these two lines

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 bool MayPreventVectorization(const HloInstruction& hlo) {       default:         return false;     }+  } else if (hlo.opcode() == HloOpcode::kReduce) {+    return false;

Maybe let's make it more specific: state what kind of reductions still prevent vectorization.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_index_boundary_condition) {+          for (int j = 0; j < x_num_steps / vector_size; j++) {+            for (int i = 0; i < vector_size; i++) {+              int linear_index = j * vector_size + i;+              llvm::Value* x_loc =+                  b_.CreateAdd(constant(j * step_x * vector_size + i),+                               start_offset_x, "x_loc");+              IrArray::Index source_idx_x =+                  source_idx.AddOffsetToDim(y_loc, kDimY, &b_)+                      .AddOffsetToDim(constant(j * step_x * vector_size + i),+                                      kDimX, &b_);+              auto emit_element = [&] {+                return emit_elem_function(source_idx_x, y_loc, x_loc,+                                          linear_index);+              };+              if (add_index_boundary_condition) {+                ksl->If(loop_name + "_x_in_tile",+                        b_.CreateICmpULT(x_loc, tile_width), emit_element);+              } else {+                emit_element();+              }+            }+          }+        };++        if (!x_tile_fits &&+            mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+          // Only try this path when we try to vectorize the loads.

The comment reads better above the if

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    int vector_size = mapping_scheme.GetVectorSize();+    return b->CreateMul(thread_id_x,+                        llvm::ConstantInt::get(index_ty, vector_size));   }   int64 x_num_steps =

I mean, you are assuming that the indexing order has to be linear, so it's useful to check for that.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_index_boundary_condition) {

Please refactor into a separate functino.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(       !IsUnrollingColumnReductionBeneficial(unnested_hlo, input_shape,                                             reduction_dimensions.dimensions[2]); -  if (!dilated_x && !reduction_dimensions.is_row_reduction) {+  KernelMappingScheme::IndexingOrder indexing_order;+  if (reduction_dimensions.is_row_reduction &&+      // Only try to vectorize+coales memory access for row of even size.+      // For odd row size, every other row isn't aligned, so can't be+      // vectorized.+      reduction_dimensions.dimensions[2] % 2 == 0) {+    indexing_order = kLinearStridedIndexingX;+  } else if (IsUnrollingColumnReductionBeneficial(+                 unnested_hlo, input_shape,+                 reduction_dimensions.dimensions[2])) {+    indexing_order = kLinearIndexingX;+  } else {+    indexing_order = kStridedIndexingX;+  }++  if (indexing_order == kLinearIndexingX &&+      !reduction_dimensions.is_row_reduction) {     // Vectorized loads: a single thread reduces two adjacent columns.

But not all row reductions are vectorized?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 class KernelMappingScheme {   // Number of threads used to process elements in the Y direction of a tile.   const int64 num_threads_y_; -  // When num_threads_x threads process a total of tile_size_x elements in the-  // X dimension of a tile, each threads process n=tile_size_x/num_threads_x-  // elements. When dilated_x=false, the n elements processed by a thread are-  // contiguous. On the other hand, when dilated_x=true the n elements are-  // dilated by a factor of num_threads_x.-  const bool dilated_x_;+  // When num_threads_x threads process a total of tile_size_x+  // elements in the X dimension of a tile, each threads process+  // n=tile_size_x/num_threads_x elements.+  // indexing_order_ define which tile's elements each thread reads.+  const IndexingOrder indexing_order_;+  // vector_size_ only supported for row reduction.

"Only supported for row reduction, ..."

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 namespace gpu { class KernelMappingScheme {  public:   enum { DimZ = 0, DimY, DimX, DimTot };+  enum IndexingOrder {+  // Threads read consecutive elements.+    LinearIndexingX,+  // Threads read strided elements while keeping memory coalescing.+    StridedIndexingX,+  // Threads read a few consecutive elements then take a strided+  // step. This can trigger vectorized reads and keep memory+  // coalescing.+    LinearStridedIndexingX

Maybe UnrolledStridedIndexingX?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {+ protected:+  void EnsureDeterminism(absl::string_view hlo_text) {+    std::vector<ExecutionProfile> profiles;+    profiles.emplace_back();+    profiles.emplace_back();+    EXPECT_TRUE(RunMultipleTimes(hlo_text,+                                 /*run_hlo_passes=*/true,+                                 /*profiles=*/&profiles,+                                 /*backend_config=*/"",+                                 /*assert_determinism=*/true));+  }+};++TEST_F(ReductionVectorizationTest, Power2) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131072] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131072] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, TileFit) {+  const char* hlo_text = R"(+HloModule ReduceTileFit++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,122880] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,122880] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+// TODO: Make this a vectorized load

OK I understand, usually we have a tracking bug or the googler owning each TODO, but I understand neither makes too much sense here yet.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();

Maybe int64 as well for consistency, and group the declaration together with step_x

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {+ protected:+  void EnsureDeterminism(absl::string_view hlo_text) {+    std::vector<ExecutionProfile> profiles;+    profiles.emplace_back();+    profiles.emplace_back();+    EXPECT_TRUE(RunMultipleTimes(hlo_text,+                                 /*run_hlo_passes=*/true,+                                 /*profiles=*/&profiles,+                                 /*backend_config=*/"",+                                 /*assert_determinism=*/true));+  }+};++TEST_F(ReductionVectorizationTest, Power2) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131072] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131072] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, TileFit) {+  const char* hlo_text = R"(+HloModule ReduceTileFit++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,122880] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,122880] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+// TODO: Make this a vectorized load+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, EvenColums) {

Columns

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {+ protected:+  void EnsureDeterminism(absl::string_view hlo_text) {+    std::vector<ExecutionProfile> profiles;+    profiles.emplace_back();+    profiles.emplace_back();+    EXPECT_TRUE(RunMultipleTimes(hlo_text,+                                 /*run_hlo_passes=*/true,+                                 /*profiles=*/&profiles,+                                 /*backend_config=*/"",+                                 /*assert_determinism=*/true));+  }+};++TEST_F(ReductionVectorizationTest, Power2) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131072] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131072] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, TileFit) {+  const char* hlo_text = R"(+HloModule ReduceTileFit++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,122880] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,122880] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+// TODO: Make this a vectorized load+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, EvenColums) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131070] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131070] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.f32+// TODO: Make this a vectorized load+CHECK-NOT: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK-NOT: ld.global.nc.v2.f32+// TODO: find how to modify LLVM/opt to get this merged? vectorize before some loop opt?

Not replied yet.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(       !IsUnrollingColumnReductionBeneficial(unnested_hlo, input_shape,                                             reduction_dimensions.dimensions[2]); -  if (!dilated_x && !reduction_dimensions.is_row_reduction) {+  KernelMappingScheme::IndexingOrder indexing_order;

Use the lambda trick:

IndexingOrder indexing_order = [&] {
  ...
  return kLinearStridedIndexingX;
}();
nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 namespace gpu { class KernelMappingScheme {  public:   enum { DimZ = 0, DimY, DimX, DimTot };+  enum IndexingOrder {+  // Threads read consecutive elements.

Commend aligned with the variable name.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_index_boundary_condition,+                          int64 vector_size) {

Why not just take vector_size from the captured environment?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 namespace gpu { class KernelMappingScheme {  public:   enum { DimZ = 0, DimY, DimX, DimTot };+  enum IndexingOrder {+  // Threads read consecutive elements.

Also I think it's more readable to write Thread reads consecutive elements, here and below, even though there are multiple threads.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    int vector_size = mapping_scheme.GetVectorSize();

Nit: I would extract a lambda for creating constants, since this is used twice here already, and then inline the vector_size in the call.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 ReductionCodegenInfo IrEmitterUnnested::ComputeReductionCodegenInfo(     return kWarpSize;   }(); +  int tile_size_x = reduction_tiling[2] * num_threads_x;++  int vector_size = 1;+  if (indexing_order == kLinearStridedIndexingX) {+    if (reduction_dimensions.dimensions[2] % 2 == 0 &&+        // As XLA unroll and suppose LLVM will vectorize,

"Assuming XLA will perform the unrolling and LLVM will vectorize" ?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 namespace { bool MayPreventVectorization(const HloInstruction& hlo) {   if (hlo.opcode() == HloOpcode::kFusion) {     return absl::c_any_of(hlo.fused_instructions_computation()->instructions(),-                          [](const HloInstruction* instr) {+                          [&](const HloInstruction* instr) {                             switch (instr->opcode()) {                               case HloOpcode::kReduce:

Since false is default, just removing it from the list would suffice?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_index_boundary_condition,+                          int64 vector_size) {+          for (int64 j = 0; j < x_num_steps / vector_size; j++) {

Also is it possible to have more descriptive names? In particular, for old_j?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_if, int64 max_element, int64 vector_size) {+          for (int64 j = 0; j < x_num_steps / vector_size; j++) {+            // Prep some values. If we do not do this, LLVM doesn't vectorize.+            llvm::Value* x_loc_base =+                b_.CreateAdd(constant(j * step_x * vector_size), start_offset_x,+                             "x_loc_base");+            IrArray::Index source_idx_x_base =+                source_idx.AddOffsetToDim(y_loc, kDimY, &b_)+                    .AddOffsetToDim(constant(j * step_x * vector_size), kDimX,+                                    &b_);++            for (int i = 0; i < vector_size; i++) {+              int old_j = j * vector_size + i;+              llvm::Value* x_loc = x_loc_base;+              IrArray::Index source_idx_x = source_idx_x_base;+              if (i > 0) {+                x_loc = b_.CreateAdd(constant(i), x_loc_base, "x_loc");+                source_idx_x =+                    source_idx_x_base.AddOffsetToDim(constant(i), kDimX, &b_);+              }+              auto emit_element = [&] {+                return emit_elem_function(source_idx_x, y_loc, x_loc, old_j);+              };+              if (add_if) {+                ksl->If(loop_name + "_x_in_tile",+                        b_.CreateICmpULT(x_loc, tile_width), emit_element);+              } else {+                emit_element();+              }+            }+          }+        };++        if (!x_tile_fits && mapping_scheme.GetIndexingOrder() ==+                                KernelMappingScheme::LinearDilatedIndexingX) {+          // Only try this path when we try to vectorize the loads.++          // Special case when the tile doesn't fit completly for even row size.+          // For odd row size every other row isn't aligned, so can't be+          // vectorized.+          ksl->If(loop_name + "_is_full_tile",+                  // if (block fully fit) {fast path} else {slow path}

Sorry I'm still a bit confused there, are you saying that when add_index_boundary_conditions is false, vectorization does not happen? Do we even need unrolling in that case then?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_index_boundary_condition,+                          int64 vector_size) {+          for (int64 j = 0; j < x_num_steps / vector_size; j++) {

For consistency, let's have both i and j as int?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 static llvm::Value* GetStartOffsetX(const KernelMappingScheme& mapping_scheme,                                     llvm::Value* thread_id_x,                                     llvm::Type* index_ty,                                     llvm::IRBuilder<>* b) {-  if (mapping_scheme.DilatedX()) {+  if (mapping_scheme.GetIndexingOrder() == kStridedIndexingX) {     return thread_id_x;+  } else if (mapping_scheme.GetIndexingOrder() == kLinearStridedIndexingX) {+    int vector_size = mapping_scheme.GetVectorSize();+    return b->CreateMul(thread_id_x,+                        llvm::ConstantInt::get(index_ty, vector_size));   }   int64 x_num_steps =

Perhaps insert a CHECK on what the indexing order has to be in the third case?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 namespace { bool MayPreventVectorization(const HloInstruction& hlo) {   if (hlo.opcode() == HloOpcode::kFusion) {     return absl::c_any_of(hlo.fused_instructions_computation()->instructions(),-                          [](const HloInstruction* instr) {+                          [&](const HloInstruction* instr) {                             switch (instr->opcode()) {                               case HloOpcode::kReduce:

Also could you add a test case affected by this particular the change in this particular function? I'm not sure what would be a good testing boundary, maybe LLVM?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 bool MayPreventVectorization(const HloInstruction& hlo) {       default:         return false;     }+  } else if (hlo.opcode() == HloOpcode::kReduce) {+    return false;

Do we need to be specific about what reduction is vectorized?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {+ protected:+  void EnsureDeterminism(absl::string_view hlo_text) {+    std::vector<ExecutionProfile> profiles;+    profiles.emplace_back();+    profiles.emplace_back();+    EXPECT_TRUE(RunMultipleTimes(hlo_text,+                                 /*run_hlo_passes=*/true,+                                 /*profiles=*/&profiles,+                                 /*backend_config=*/"",+                                 /*assert_determinism=*/true));+  }+};++TEST_F(ReductionVectorizationTest, Power2) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131072] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131072] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.v2.f32

OK makes sense.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {+ protected:+  void EnsureDeterminism(absl::string_view hlo_text) {+    std::vector<ExecutionProfile> profiles;+    profiles.emplace_back();+    profiles.emplace_back();+    EXPECT_TRUE(RunMultipleTimes(hlo_text,+                                 /*run_hlo_passes=*/true,+                                 /*profiles=*/&profiles,+                                 /*backend_config=*/"",+                                 /*assert_determinism=*/true));+  }+};++TEST_F(ReductionVectorizationTest, Power2) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131072] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131072] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, TileFit) {+  const char* hlo_text = R"(+HloModule ReduceTileFit++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,122880] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,122880] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+// TODO: Make this a vectorized load

What's the TODO item here? Could you expand on what is not working and why?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 namespace gpu { class KernelMappingScheme {  public:   enum { DimZ = 0, DimY, DimX, DimTot };+  // TODO: rename Dilated to Strided?+  // LinearIndexing mean each thread reads consecutive elements.

"means", "reads", "takes", "conserves", probably more readable if the comment for each enum value is on top of it.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_if, int64 max_element, int64 vector_size) {+          for (int64 j = 0; j < x_num_steps / vector_size; j++) {+            // Prep some values. If we do not do this, LLVM doesn't vectorize.+            llvm::Value* x_loc_base =+                b_.CreateAdd(constant(j * step_x * vector_size), start_offset_x,+                             "x_loc_base");+            IrArray::Index source_idx_x_base =+                source_idx.AddOffsetToDim(y_loc, kDimY, &b_)+                    .AddOffsetToDim(constant(j * step_x * vector_size), kDimX,+                                    &b_);++            for (int i = 0; i < vector_size; i++) {+              int old_j = j * vector_size + i;+              llvm::Value* x_loc = x_loc_base;+              IrArray::Index source_idx_x = source_idx_x_base;+              if (i > 0) {

Is the branch worth it here? LLVM will optimize away the addition with zero.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {

I'm curious what happens in the input fusion case.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_if, int64 max_element, int64 vector_size) {+          for (int64 j = 0; j < x_num_steps / vector_size; j++) {+            // Prep some values. If we do not do this, LLVM doesn't vectorize.+            llvm::Value* x_loc_base =+                b_.CreateAdd(constant(j * step_x * vector_size), start_offset_x,+                             "x_loc_base");+            IrArray::Index source_idx_x_base =+                source_idx.AddOffsetToDim(y_loc, kDimY, &b_)+                    .AddOffsetToDim(constant(j * step_x * vector_size), kDimX,+                                    &b_);++            for (int i = 0; i < vector_size; i++) {+              int old_j = j * vector_size + i;+              llvm::Value* x_loc = x_loc_base;+              IrArray::Index source_idx_x = source_idx_x_base;+              if (i > 0) {+                x_loc = b_.CreateAdd(constant(i), x_loc_base, "x_loc");+                source_idx_x =+                    source_idx_x_base.AddOffsetToDim(constant(i), kDimX, &b_);+              }+              auto emit_element = [&] {+                return emit_elem_function(source_idx_x, y_loc, x_loc, old_j);+              };+              if (add_if) {+                ksl->If(loop_name + "_x_in_tile",+                        b_.CreateICmpULT(x_loc, tile_width), emit_element);+              } else {+                emit_element();+              }+            }+          }+        };++        if (!x_tile_fits && mapping_scheme.GetIndexingOrder() ==+                                KernelMappingScheme::LinearDilatedIndexingX) {+          // Only try this path when we try to vectorize the loads.++          // Special case when the tile doesn't fit completly for even row size.+          // For odd row size every other row isn't aligned, so can't be+          // vectorized.+          ksl->If(loop_name + "_is_full_tile",+                  // if (block fully fit) {fast path} else {slow path}+                  // tile_width is always exact. For the last block,+                  // it will be the exact number of elements left.+                  b_.CreateICmpEQ(constant(mapping_scheme.GetTileSizeFor(2)),

What's 2 here?

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 bool MaybeLoadPtxFromFile(const HloModule* module, std::string* ptx) {   // and warn when a file is not used to ease catching typo in filename.   std::string prefix = xla::FilenameFor(*module, "", *ptx);   std::string matched_filename;-  for (const string filename :+  for (const string full_filename :

If this change is needed, it should be a separate PR.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 class KernelMappingScheme {   // Number of threads used to process elements in the Y direction of a tile.   const int64 num_threads_y_; -  // When num_threads_x threads process a total of tile_size_x elements in the-  // X dimension of a tile, each threads process n=tile_size_x/num_threads_x-  // elements. When dilated_x=false, the n elements processed by a thread are-  // contiguous. On the other hand, when dilated_x=true the n elements are-  // dilated by a factor of num_threads_x.-  const bool dilated_x_;+  // When num_threads_x threads process a total of tile_size_x+  // elements in the X dimension of a tile, each threads process+  // n=tile_size_x/num_threads_x elements.+  // indexing_order_ define which tile's elements each thread reads.+  const IndexingOrder indexing_order_;+  // vector_size_ only supported for row reduction.+  // Must be a divisor of tile_sizes_[2]/num_threads_x.+  // Interesting values are 2 and 4 to trigger vectorized loads on GPUs+  // While keeping memory coalescing.

"while"

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <utility>++#include "tensorflow/compiler/xla/service/gpu/gpu_executable.h"+#include "tensorflow/compiler/xla/service/gpu/tests/gpu_codegen_test.h"+#include "tensorflow/compiler/xla/service/hlo_instruction.h"+#include "tensorflow/compiler/xla/service/hlo_module_config.h"+#include "tensorflow/compiler/xla/service/hlo_parser.h"+#include "tensorflow/compiler/xla/statusor.h"+#include "tensorflow/compiler/xla/tests/filecheck.h"+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"+#include "tensorflow/core/lib/core/status_test_util.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/stream_executor/lib/statusor.h"++namespace xla {+namespace gpu {++namespace {++class ReductionVectorizationTest : public GpuCodegenTest {+ protected:+  void EnsureDeterminism(absl::string_view hlo_text) {+    std::vector<ExecutionProfile> profiles;+    profiles.emplace_back();+    profiles.emplace_back();+    EXPECT_TRUE(RunMultipleTimes(hlo_text,+                                 /*run_hlo_passes=*/true,+                                 /*profiles=*/&profiles,+                                 /*backend_config=*/"",+                                 /*assert_determinism=*/true));+  }+};++TEST_F(ReductionVectorizationTest, Power2) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131072] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131072] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, TileFit) {+  const char* hlo_text = R"(+HloModule ReduceTileFit++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,122880] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,122880] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+// TODO: Make this a vectorized load+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+)");++  EXPECT_TRUE(RunAndCompare(hlo_text, ErrorSpec{1e-5, 1e-5}));+}++TEST_F(ReductionVectorizationTest, EvenColums) {+  const char* hlo_text = R"(+HloModule ReducePower2++%max_ {+  %x = f32[] parameter(0)+  %y = f32[] parameter(1)+  ROOT %maximum.7 = f32[] maximum(f32[] %x, f32[] %y)+}++ENTRY %main {+  %param_0 = f32[5,131070] parameter(0)+  %constant.3 = f32[] constant(0)+  ROOT %reduce.8 = f32[5] reduce(f32[5,131070] %param_0, f32[] %constant.3), dimensions={1}, to_apply=%max_+}+)";+  TF_ASSERT_OK_AND_ASSIGN(std::unique_ptr<VerifiedHloModule> optimized_module,+                          ParseAndReturnVerifiedModule(hlo_text));+  CompileAndOptionallyVerifyPtx(std::move(optimized_module),+                                R"(+CHECK: ld.global.nc.f32+CHECK: ld.global.nc.f32+// TODO: Make this a vectorized load+CHECK-NOT: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK: ld.global.nc.v2.f32+CHECK-NOT: ld.global.nc.v2.f32+// TODO: find how to modify LLVM/opt to get this merged? vectorize before some loop opt?

Again, to me (and someone unfamiliar with the code) this comment is cryptic.

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 void IrEmitterUnnested::EmitTile(   //   // TODO(cheshire): Once ptxas is fixed and TF switches to it, remove the   // workaround.-  ksl->For(loop_name + "_y_in_tile",-           /*start=*/constant(0),-           /*end=*/-           ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),-                         num_threads_y),-           /*step=*/constant(1), [&](llvm::Value* y_indvar) {-             llvm::Value* y_loc =-                 b_.CreateAdd(thread_id_info.thread_id_y,-                              b_.CreateMul(y_indvar, num_threads_y));-             for (int64 j = 0; j < x_num_steps; j++) {-               llvm::Value* x_loc =-                   b_.CreateAdd(constant(j * step_x), start_offset_x, "x_loc");-               IrArray::Index source_idx_x =-                   source_idx.AddOffsetToDim(y_loc, kDimY, &b_)-                       .AddOffsetToDim(constant(j * step_x), kDimX, &b_);-               auto emit_element = [&] {-                 return emit_elem_function(source_idx_x, y_loc, x_loc, j);-               };-               if (!x_tile_fits) {-                 ksl->If(loop_name + "_x_in_tile",-                         b_.CreateICmpULT(x_loc, tile_width), emit_element);-               } else {-                 emit_element();-               }-             }-           });+  int vector_size = mapping_scheme.GetVectorSize();+  ksl->For(+      loop_name + "_y_in_tile",+      /*start=*/constant(0),+      /*end=*/+      ceil_of_ratio(b_.CreateSub(tile_height, thread_id_info.thread_id_y),+                    num_threads_y),+      /*step=*/constant(1), [&](llvm::Value* y_indvar) {+        llvm::Value* y_loc = b_.CreateAdd(+            thread_id_info.thread_id_y, b_.CreateMul(y_indvar, num_threads_y));+        auto unroll = [&](bool add_if, int64 max_element, int64 vector_size) {+          for (int64 j = 0; j < x_num_steps / vector_size; j++) {+            // Prep some values. If we do not do this, LLVM doesn't vectorize.+            llvm::Value* x_loc_base =+                b_.CreateAdd(constant(j * step_x * vector_size), start_offset_x,+                             "x_loc_base");+            IrArray::Index source_idx_x_base =+                source_idx.AddOffsetToDim(y_loc, kDimY, &b_)+                    .AddOffsetToDim(constant(j * step_x * vector_size), kDimX,+                                    &b_);++            for (int i = 0; i < vector_size; i++) {+              int old_j = j * vector_size + i;+              llvm::Value* x_loc = x_loc_base;+              IrArray::Index source_idx_x = source_idx_x_base;+              if (i > 0) {+                x_loc = b_.CreateAdd(constant(i), x_loc_base, "x_loc");+                source_idx_x =+                    source_idx_x_base.AddOffsetToDim(constant(i), kDimX, &b_);+              }+              auto emit_element = [&] {+                return emit_elem_function(source_idx_x, y_loc, x_loc, old_j);+              };+              if (add_if) {+                ksl->If(loop_name + "_x_in_tile",+                        b_.CreateICmpULT(x_loc, tile_width), emit_element);+              } else {+                emit_element();+              }+            }+          }+        };++        if (!x_tile_fits && mapping_scheme.GetIndexingOrder() ==+                                KernelMappingScheme::LinearDilatedIndexingX) {+          // Only try this path when we try to vectorize the loads.++          // Special case when the tile doesn't fit completly for even row size.+          // For odd row size every other row isn't aligned, so can't be+          // vectorized.+          ksl->If(loop_name + "_is_full_tile",+                  // if (block fully fit) {fast path} else {slow path}+                  // tile_width is always exact. For the last block,+                  // it will be the exact number of elements left.+                  b_.CreateICmpEQ(constant(mapping_scheme.GetTileSizeFor(2)),+                                  tile_width),+                  [&] { unroll(false, x_num_steps, vector_size); },

/*add_if=*/

nouiz

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[XLA] vectorize row reduction for even row size

 namespace {  // Returns true if the fusion contains any instruction that is likely // translated to complex LLVM IR, such as loops, and prevent vectorization.-bool MayPreventVectorization(const HloInstruction& hlo) {+bool MayPreventVectorization(const HloInstruction& hlo,+                             const bool tolerate_reduce = false) {

Also could you expand on why do we need this parameter? In general, I'd like to avoid having boolean parameters as much as possible, as they greatly increase the code complexity by doubling the number of possible execution paths.

nouiz

comment created time in 3 months

more