profile
viewpoint
Carlos Eduardo Seo ceseo @IBM São Paulo, Brazil https://www.linkedin.com/in/carlosseo/ Senior Software Engineer @IBM — ML/DNN Open Ecosystem performance optimization and CPU enablement for POWER architecture • @golang compiler contributor

ceseo/glibc 1

My personal glibc repository

ceseo/go 1

The Go programming language

ceseo/libauxv 1

The Auxilary-Vector Library

ceseo/bazel 0

a fast, scalable, multi-language and extensible build system

ceseo/bazelisk 0

A user-friendly launcher for Bazel.

ceseo/binutils-gdb 0

My personal binutils-gdb repository

ceseo/dep 0

Go dependency management tool

ceseo/libxsmm 0

Library targeting Intel Architecture for specialized dense and sparse matrix operations, and deep learning primitives.

ceseo/numpy 0

The fundamental package for scientific computing with Python.

ceseo/pytorch 0

Tensors and Dynamic neural networks in Python with strong GPU acceleration

issue commentgolang/go

cmd/compile: ppc64le: Invalid n or b for CLRLSLDI: 30 28

Looking into it.

ianlancetaylor

comment created time in 2 hours

issue commentgolang/go

proposal: testing: add CPU name to standard benchmark labels

@martisch please bear in mind that, if you want to add the processor capabilities in the future (VSX, etc), you will have to read both AT_HWCAP and AT_HWCAP2 on ppc64x.

martisch

comment created time in 12 days

issue commentgolang/go

proposal: testing: add CPU name to standard benchmark labels

For ppc64x, you have to rely on the OS. Currently, there is no exposed hardware CPU ID to userspace. On Linux, you can get that from /proc/cpuinfo or directly from the auxiliary vector (via the AT_PLATFORM variable — see LD_SHOW_AUXV=1)

martisch

comment created time in 12 days

issue commentxianyi/OpenBLAS

Adding POWER9 to travis CI

No, Travis would suffice for me. And if they are in PowerVS, they will eventually get POWER10 in the future.

RajalakshmiSR

comment created time in 20 days

issue commentxianyi/OpenBLAS

Adding POWER9 to travis CI

I think you have to configure a github-webhook in the repository settings pointing to the Jenkins server.

RajalakshmiSR

comment created time in 21 days

Pull request review commentpytorch/pytorch

Vsx initial support issue27678

+#pragma once++#include <ATen/cpu/vec256/intrinsics.h>+#include <ATen/cpu/vec256/vec256_base.h>+#include <ATen/cpu/vec256/vsx/vsx_helpers.h>+#include <ATen/cpu/vec256/vsx/vec256_double_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_float_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_int16_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_int32_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_int64_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_qint32_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_qint8_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_quint8_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_complex_float_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_complex_double_vsx.h>+namespace at {+namespace vec256 {++namespace {++DEFINE_CLAMP_FUNCS(c10::quint8)+DEFINE_CLAMP_FUNCS(c10::qint8)+DEFINE_CLAMP_FUNCS(c10::qint32)+DEFINE_CLAMP_FUNCS(int16_t)+DEFINE_CLAMP_FUNCS(int32_t)+DEFINE_CLAMP_FUNCS(int64_t)+DEFINE_CLAMP_FUNCS(float)+DEFINE_CLAMP_FUNCS(double)++template <>+Vec256<double> inline __inline_attrs fmadd(+    const Vec256<double>& a,+    const Vec256<double>& b,+    const Vec256<double>& c) {+  return Vec256<double>{+      vec_madd(a.vec0(), b.vec0(), c.vec0()),+      vec_madd(a.vec1(), b.vec1(), c.vec1())};+}++template <>+Vec256<int64_t> inline __inline_attrs fmadd(+    const Vec256<int64_t>& a,+    const Vec256<int64_t>& b,+    const Vec256<int64_t>& c) {+  return Vec256<int64_t>{+      a.vec0() * b.vec0() + c.vec0(), a.vec1() * b.vec1() + c.vec1()};+}+template <>+Vec256<int32_t> inline __inline_attrs fmadd(+    const Vec256<int32_t>& a,+    const Vec256<int32_t>& b,+    const Vec256<int32_t>& c) {+  return Vec256<int32_t>{+      a.vec0() * b.vec0() + c.vec0(), a.vec1() * b.vec1() + c.vec1()};+}+template <>+Vec256<int16_t> inline __inline_attrs fmadd(+    const Vec256<int16_t>& a,+    const Vec256<int16_t>& b,+    const Vec256<int16_t>& c) {+  return Vec256<int16_t>{+      a.vec0() * b.vec0() + c.vec0(), a.vec1() * b.vec1() + c.vec1()};+}++DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(float)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(double)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(int64_t)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(int32_t)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(int16_t)++template <>+Vec256<int64_t> inline __inline_attrs convert_to_int_of_same_size<double>(+    const Vec256<double>& src) {+  return Vec256<int64_t>{vec_signed(src.vec0()), vec_signed(src.vec1())};+}++template <>+Vec256<int32_t> inline __inline_attrs convert_to_int_of_same_size<float>(+    const Vec256<float>& src) {+  return Vec256<int32_t>{vec_signed(src.vec0()), vec_signed(src.vec1())};+}++template <>+inline void convert(const int32_t* src, float* dst, int64_t n) {+  // int32_t and float have same size+  size_t N = n;+  size_t NE = N & (~(Vec256<int32_t>::size() - 1));+  for (size_t i = 0; i < NE; i += Vec256<int32_t>::size()) {+    const int32_t* src_a = src + i;+    float* dst_a = dst + i;+    __vi input_vec0 = vec_vsx_ld(offset0, reinterpret_cast<const __vi*>(src_a));+    __vi input_vec1 =+        vec_vsx_ld(offset16, reinterpret_cast<const __vi*>(src_a));+    vec_vsx_st(vec_float(input_vec0), offset0, dst_a);

Or conditionally define it in the header.

quickwritereader

comment created time in a month

PullRequestReviewEvent

Pull request review commentpytorch/pytorch

Vsx initial support issue27678

+

You need also look for gcc-8.0 due to some of the types you used (vec_float).

quickwritereader

comment created time in a month

Pull request review commentpytorch/pytorch

Vsx initial support issue27678

++#pragma once+#include <ATen/cpu/vec256/intrinsics.h>+#include <ATen/cpu/vec256/vec256_base.h>+#include <ATen/cpu/vec256/vsx/vsx_helpers.h>+#include <c10/util/complex.h>++namespace at {+namespace vec256 {+// See Note [Acceptable use of anonymous namespace in header]+namespace {+using ComplexFlt = c10::complex<float>;++template <>+class Vec256<ComplexFlt> {+ private:+  union {+    struct {+      __vf _vec0;+      __vf _vec1;+    };+    struct {+      __vib _vecb0;+      __vib _vecb1;+    };++  } __attribute__((__may_alias__));++ public:+  using value_type = ComplexFlt;+  using vec_internal_type = __vf;+  using vec_internal_mask_type = __vib;++  static constexpr int size() {+    return 4;+  }+  Vec256() {}++  __inline_attrs Vec256(__vf v) : _vec0{v}, _vec1{v} {}+  __inline_attrs Vec256(__vib vmask) : _vecb0{vmask}, _vecb1{vmask} {}+  __inline_attrs Vec256(__vf v1, __vf v2) : _vec0{v1}, _vec1{v2} {}+  __inline_attrs Vec256(__vib v1, __vib v2) : _vecb0{v1}, _vecb1{v2} {}++  Vec256(ComplexFlt val) {+    float real_value = val.real();+    float imag_value = val.imag();+    _vec0 = __vf{real_value, imag_value, real_value, imag_value};+    _vec1 = __vf{real_value, imag_value, real_value, imag_value};+  }++  Vec256(ComplexFlt val1, ComplexFlt val2, ComplexFlt val3, ComplexFlt val4) {+    _vec0 = __vf{val1.real(), val1.imag(), val2.real(), val2.imag()};+    _vec1 = __vf{val3.real(), val3.imag(), val4.real(), val4.imag()};+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 0, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    return a;+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 1, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    return b;+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 2, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    return {b._vec0, a._vec1};+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 3, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    return {a._vec0, b._vec1};+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 4, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    const __vib mask_1st = VsxComplexMask1(mask);+    return {(__vf)vec_sel(a._vec0, b._vec0, mask_1st), a._vec1};+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 5, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    const __vib mask_1st = VsxComplexMask1(mask);+    return {(__vf)vec_sel(a._vec0, b._vec0, mask_1st), b._vec1};+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 6, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    const __vib mask_2nd = VsxComplexMask2(mask);+    // generated masks+    return {a._vec0, (__vf)vec_sel(a._vec1, b._vec1, mask_2nd)};+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 7, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    const __vib mask_2nd = VsxComplexMask2(mask);+    // generated masks+    return {b._vec0, (__vf)vec_sel(a._vec1, b._vec1, mask_2nd)};+  }++  template <uint64_t mask>+  static std::enable_if_t<blendChoiceComplex(mask) == 8, Vec256<ComplexFlt>>+      __inline_attrs+      blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    const __vib mask_1st = VsxComplexMask1(mask);+    const __vib mask_2nd = VsxComplexMask2(mask);+    return {+        (__vf)vec_sel(a._vec0, b._vec0, mask_1st),+        (__vf)vec_sel(a._vec1, b._vec1, mask_2nd)};+  }++  template <int64_t mask>+  static Vec256<ComplexFlt> __inline_attrs+  el_blend(const Vec256<ComplexFlt>& a, const Vec256<ComplexFlt>& b) {+    const __vib mask_1st = VsxMask1(mask);+    const __vib mask_2nd = VsxMask2(mask);+    return {+        (__vf)vec_sel(a._vec0, b._vec0, mask_1st),+        (__vf)vec_sel(a._vec1, b._vec1, mask_2nd)};+  }++  static Vec256<ComplexFlt> blendv(+      const Vec256<ComplexFlt>& a,+      const Vec256<ComplexFlt>& b,+      const Vec256<ComplexFlt>& mask) {+    // convert std::complex<V> index mask to V index mask: xy -> xxyy+    auto mask_complex = Vec256<ComplexFlt>(+        vec_mergeh(mask._vec0, mask._vec0), vec_mergeh(mask._vec1, mask._vec1));+    // mask_complex.dump();+    return {+        vec_sel(a._vec0, b._vec0, mask_complex._vec0),+        vec_sel(a._vec1, b._vec1, mask_complex._vec1),+    };+  }++  static Vec256<ComplexFlt> elwise_blendv(+      const Vec256<ComplexFlt>& a,+      const Vec256<ComplexFlt>& b,+      const Vec256<ComplexFlt>& mask) {+    return {+        vec_sel(a._vec0, b._vec0, mask._vec0),+        vec_sel(a._vec1, b._vec1, mask._vec1),+    };+  }++  template <typename step_t>+  static Vec256<ComplexFlt> arange(+      ComplexFlt base = 0.,+      step_t step = static_cast<step_t>(1)) {+    return Vec256<ComplexFlt>(+        base,+        base + step,+        base + ComplexFlt(2) * step,+        base + ComplexFlt(3) * step);+  }+  static Vec256<ComplexFlt> set(+      const Vec256<ComplexFlt>& a,+      const Vec256<ComplexFlt>& b,+      int64_t count = size()) {+    switch (count) {+      case 0:+        return a;+      case 1:+        return blend<1>(a, b);+      case 2:+        return blend<3>(a, b);+      case 3:+        return blend<7>(a, b);+    }+    return b;+  }++  static Vec256<value_type> __inline_attrs+  loadu(const void* ptr, int count = size()) {+    if (count == size()) {+      return {+          vec_vsx_ld(offset0, reinterpret_cast<const float*>(ptr)),+          vec_vsx_ld(offset16, reinterpret_cast<const float*>(ptr))};+    }++    __at_align32__ value_type tmp_values[size()];+    std::memcpy(tmp_values, ptr, std::min(count, size()) * sizeof(value_type));++    return {+        vec_vsx_ld(offset0, reinterpret_cast<const float*>(tmp_values)),+        vec_vsx_ld(offset16, reinterpret_cast<const float*>(tmp_values))};+  }++  void __inline_attrs store(void* ptr, int count = size()) const {+    if (count == size()) {+      vec_vsx_st(_vec0, offset0, reinterpret_cast<float*>(ptr));+      vec_vsx_st(_vec1, offset16, reinterpret_cast<float*>(ptr));+    } else if (count > 0) {+      __at_align32__ value_type tmp_values[size()];+      vec_vsx_st(_vec0, offset0, reinterpret_cast<float*>(tmp_values));+      vec_vsx_st(_vec1, offset16, reinterpret_cast<float*>(tmp_values));+      std::memcpy(+          ptr, tmp_values, std::min(count, size()) * sizeof(value_type));+    }+  }++  const ComplexFlt& operator[](int idx) const = delete;+  ComplexFlt& operator[](int idx) = delete;++  Vec256<ComplexFlt> map(ComplexFlt (*f)(const ComplexFlt&)) const {+    __at_align32__ ComplexFlt tmp[size()];+    store(tmp);+    for (int i = 0; i < size(); i++) {+      tmp[i] = f(tmp[i]);+    }+    return loadu(tmp);+  }++  static Vec256<ComplexFlt> horizontal_add_permD8(+      Vec256<ComplexFlt>& first,+      Vec256<ComplexFlt>& second) {+    // we will simulate it differently with 6 instructions total+    // lets permute second so that we can add it getting horizontall sums

Typo: 'horizontal'

quickwritereader

comment created time in a month

Pull request review commentpytorch/pytorch

Vsx initial support issue27678

+#pragma once+#include <ATen/cpu/vec256/intrinsics.h>+#include <ATen/cpu/vec256/vec256_base.h>+#include <ATen/cpu/vec256/vsx/vsx_helpers.h>+#include <c10/util/complex.h>++namespace at {+namespace vec256 {+// See Note [Acceptable use of anonymous namespace in header]+namespace {+using ComplexDbl = c10::complex<double>;++template <>+class Vec256<ComplexDbl> {+  union {+    struct {+      __vd _vec0;+      __vd _vec1;+    };+    struct {+      __vllb _vecb0;+      __vllb _vecb1;+    };++  } __attribute__((__may_alias__));++ public:+  using value_type = ComplexDbl;+  using vec_internal_type = __vd;+  using vec_internal_mask_type = __vllb;+  static constexpr int size() {+    return 2;+  }+  Vec256() {}+  __inline_attrs Vec256(__vd v) : _vec0{v}, _vec1{v} {}+  __inline_attrs Vec256(__vllb vmask) : _vecb0{vmask}, _vecb1{vmask} {}+  __inline_attrs Vec256(__vd v1, __vd v2) : _vec0{v1}, _vec1{v2} {}+  __inline_attrs Vec256(__vllb v1, __vllb v2) : _vecb0{v1}, _vecb1{v2} {}++  Vec256(ComplexDbl val) {+    double real_value = val.real();+    double imag_value = val.imag();+    _vec0 = __vd{real_value, imag_value};+    _vec1 = __vd{real_value, imag_value};+  }+  Vec256(ComplexDbl val1, ComplexDbl val2) {+    _vec0 = __vd{val1.real(), val1.imag()};+    _vec1 = __vd{val2.real(), val2.imag()};+  }++  inline __inline_attrs const vec_internal_type& vec0() const {+    return _vec0;+  }+  inline __inline_attrs const vec_internal_type& vec1() const {+    return _vec1;+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 0, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return a;+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 1, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return b;+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 2, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return {b._vec0, a._vec1};+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 3, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return {a._vec0, b._vec1};+  }++  template <int64_t mask>+  static Vec256<ComplexDbl> __inline_attrs+  el_blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    const __vllb mask_1st = VsxDblMask1(mask);+    const __vllb mask_2nd = VsxDblMask2(mask);+    return {+        (__vd)vec_sel(a._vec0, b._vec0, mask_1st),+        (__vd)vec_sel(a._vec1, b._vec1, mask_2nd)};+  }++  static Vec256<ComplexDbl> blendv(+      const Vec256<ComplexDbl>& a,+      const Vec256<ComplexDbl>& b,+      const Vec256<ComplexDbl>& mask) {+    // convert std::complex<V> index mask to V index mask: xy -> xxyy+    auto mask_complex =+        Vec256<ComplexDbl>(vec_splat(mask._vec0, 0), vec_splat(mask._vec1, 0));+    return {+        vec_sel(a._vec0, b._vec0, mask_complex._vecb0),+        vec_sel(a._vec1, b._vec1, mask_complex._vecb1)};+  }++  static Vec256<ComplexDbl> __inline_attrs elwise_blendv(+      const Vec256<ComplexDbl>& a,+      const Vec256<ComplexDbl>& b,+      const Vec256<ComplexDbl>& mask) {+    return {+        vec_sel(a._vec0, b._vec0, mask._vecb0),+        vec_sel(a._vec1, b._vec1, mask._vecb1)};+  }+  template <typename step_t>+  static Vec256<ComplexDbl> arange(+      ComplexDbl base = 0.,+      step_t step = static_cast<step_t>(1)) {+    return Vec256<ComplexDbl>(base, base + step);+  }+  static Vec256<ComplexDbl> set(+      const Vec256<ComplexDbl>& a,+      const Vec256<ComplexDbl>& b,+      int64_t count = size()) {+    switch (count) {+      case 0:+        return a;+      case 1:+        return blend<1>(a, b);+    }+    return b;+  }++  static Vec256<value_type> __inline_attrs+  loadu(const void* ptr, int count = size()) {+    if (count == size()) {+      return {+          vec_vsx_ld(offset0, reinterpret_cast<const double*>(ptr)),+          vec_vsx_ld(offset16, reinterpret_cast<const double*>(ptr))};+    }++    __at_align32__ value_type tmp_values[size()];+    std::memcpy(tmp_values, ptr, std::min(count, size()) * sizeof(value_type));++    return {+        vec_vsx_ld(offset0, reinterpret_cast<const double*>(tmp_values)),+        vec_vsx_ld(offset16, reinterpret_cast<const double*>(tmp_values))};+  }+  void __inline_attrs store(void* ptr, int count = size()) const {+    if (count == size()) {+      vec_vsx_st(_vec0, offset0, reinterpret_cast<double*>(ptr));+      vec_vsx_st(_vec1, offset16, reinterpret_cast<double*>(ptr));+    } else if (count > 0) {+      __at_align32__ value_type tmp_values[size()];+      vec_vsx_st(_vec0, offset0, reinterpret_cast<double*>(tmp_values));+      vec_vsx_st(_vec1, offset16, reinterpret_cast<double*>(tmp_values));+      std::memcpy(+          ptr, tmp_values, std::min(count, size()) * sizeof(value_type));+    }+  }++  const ComplexDbl& operator[](int idx) const = delete;+  ComplexDbl& operator[](int idx) = delete;+  Vec256<ComplexDbl> map(ComplexDbl (*f)(const ComplexDbl&)) const {+    __at_align32__ ComplexDbl tmp[size()];+    store(tmp);+    for (int i = 0; i < size(); i++) {+      tmp[i] = f(tmp[i]);+    }+    return loadu(tmp);+  }++  Vec256<ComplexDbl> el_swapped() const {+    __vd v0 = vec_xxpermdi(_vec0, _vec0, 2);+    __vd v1 = vec_xxpermdi(_vec1, _vec1, 2);+    return {v0, v1};+  }++  Vec256<ComplexDbl> el_madd(+      const Vec256<ComplexDbl>& multiplier,+      const Vec256<ComplexDbl>& val) const {+    return {+        vec_madd(_vec0, multiplier._vec0, val._vec0),+        vec_madd(_vec1, multiplier._vec1, val._vec1)};+  }++  Vec256<ComplexDbl> el_mergeo() const {+    __vd v0 = vec_splat(_vec0, 1);+    __vd v1 = vec_splat(_vec1, 1);+    return {v0, v1};+  }++  Vec256<ComplexDbl> el_mergee() const {+    __vd v0 = vec_splat(_vec0, 0);+    __vd v1 = vec_splat(_vec1, 0);+    return {v0, v1};+  }++  static Vec256<ComplexDbl> el_mergee(+      Vec256<ComplexDbl>& first,+      Vec256<ComplexDbl>& second) {+    // as mergee phased in , we can use vec_perm with mask+    return {+        vec_mergeh(first._vec0, second._vec0),+        vec_mergeh(first._vec1, second._vec1)};+  }++  Vec256<ComplexDbl> abs_2_() const {+    auto a = (*this).elwise_mult(*this);+    auto permuted = a.el_swapped();+    a = a + permuted;+    return a;+  }++  Vec256<ComplexDbl> abs_() const {+    auto ret = abs_2_();+    return ret.elwise_sqrt();+  }++  Vec256<ComplexDbl> abs() const {+    return abs_() & vd_real_mask;+  }++  Vec256<ComplexDbl> angle_() const {+    // angle = atan2(b/a)+    // auto b_a = _mm256_permute_pd(values, 0x05);     // b        a+    // return Sleef_atan2d4_u10(values, b_a);          // 90-angle angle+    auto ret = el_swapped();+    for (int i = 0; i < 2; i++) {+      ret._vec0[i] = std::atan2(_vec0[i], ret._vec0[i]);+      ret._vec1[i] = std::atan2(_vec1[i], ret._vec0[i]);+    }+    return ret;+  }++  Vec256<ComplexDbl> angle() const {+    auto a = angle_().el_swapped();+    return a & vd_real_mask;+  }++  Vec256<ComplexDbl> real_() const {+    return *this & vd_real_mask;+  }+  Vec256<ComplexDbl> real() const {+    return *this & vd_real_mask;+  }+  Vec256<ComplexDbl> imag_() const {+    return *this & vd_imag_mask;+  }+  Vec256<ComplexDbl> imag() const {+    return imag_().el_swapped();+  }++  Vec256<ComplexDbl> conj_() const {+    return *this ^ vd_isign_mask;+  }+  Vec256<ComplexDbl> conj() const {+    return *this ^ vd_isign_mask;+  }++  Vec256<ComplexDbl> log() const {+    // Most trigonomic ops use the log() op to improve complex number+    // performance.+    return map(std::log);+  }++  Vec256<ComplexDbl> log2() const {+    // log2eB_inv+    auto ret = log();+    return ret.elwise_mult(vd_log2e_inv);+  }+  Vec256<ComplexDbl> log10() const {+    auto ret = log();+    return ret.elwise_mult(vd_log10e_inv);+  }++  Vec256<ComplexDbl> asin() const {+    // asin(x)+    // = -i*ln(iz + sqrt(1 -z^2))+    // = -i*ln((ai - b) + sqrt(1 - (a + bi)*(a + bi)))+    // = -i*ln((-b + ai) + sqrt(1 - (a**2 - b**2) - 2*abi))+    auto conj = conj_();+    auto b_a = conj.el_swapped();+    auto ab = conj.elwise_mult(b_a);+    auto im = ab + ab;+    auto val_2 = (*this).elwise_mult(*this);+    auto val_2_swapped = val_2.el_swapped();+    auto re = horizontal_sub(val_2, val_2_swapped);+    re = Vec256<ComplexDbl>(vd_one) - re;+    auto root = el_blend<0x0A>(re, im).sqrt();+    auto ln = (b_a + root).log();+    return ln.el_swapped().conj();+  }++  Vec256<ComplexDbl> acos() const {+    // acos(x) = pi/2 - asin(x)+    return Vec256(vd_pi_2) - asin();+  }++  Vec256<ComplexDbl> atan() const {+    // atan(x) = i/2 * ln((i + z)/(i - z))+    auto ione = Vec256(vd_imag_one);+    auto sum = ione + *this;+    auto sub = ione - *this;+    auto ln = (sum / sub).log(); // ln((i + z)/(i - z))+    return ln * vd_imag_half; // i/2*ln()+  }++  Vec256<ComplexDbl> sin() const {+    return map(std::sin);+  }+  Vec256<ComplexDbl> sinh() const {+    return map(std::sinh);+  }+  Vec256<ComplexDbl> cos() const {+    return map(std::cos);+  }+  Vec256<ComplexDbl> cosh() const {+    return map(std::cosh);+  }++  Vec256<ComplexDbl> tan() const {+    return map(std::tan);+  }+  Vec256<ComplexDbl> tanh() const {+    return map(std::tanh);+  }+  Vec256<ComplexDbl> ceil() const {+    return {vec_ceil(_vec0), vec_ceil(_vec1)};+  }+  Vec256<ComplexDbl> floor() const {+    return {vec_floor(_vec0), vec_floor(_vec1)};+  }+  Vec256<ComplexDbl> neg() const {+    auto z = Vec256<ComplexDbl>(vd_zero);+    return z - *this;+  }+  Vec256<ComplexDbl> round() const {+    return {vec_rint(_vec0), vec_rint(_vec1)};+  }++  Vec256<ComplexDbl> trunc() const {+    return {vec_trunc(_vec0), vec_trunc(_vec1)};+  }++  Vec256<ComplexDbl> elwise_sqrt() const {+    return {vec_sqrt(_vec0), vec_sqrt(_vec1)};+  }++  void dump() const {+    std::cout << _vec0[0] << "," << _vec0[1] << ",";+    std::cout << _vec1[0] << "," << _vec1[1] << std::endl;+  }++  Vec256<ComplexDbl> sqrt() const {+    //   sqrt(a + bi)+    // = sqrt(2)/2 * [sqrt(sqrt(a**2 + b**2) + a) + sgn(b)*sqrt(sqrt(a**2 ++    // b**2) - a)i] = sqrt(2)/2 * [sqrt(abs() + a) + sgn(b)*sqrt(abs() - a)i]++    auto sign = *this & vd_isign_mask;+    auto factor = sign | vd_sqrt2_2;+    auto a_a = el_mergee();+    // a_a.dump();+    a_a = a_a ^ vd_isign_mask; // a -a+    auto res_re_im = (abs_() + a_a).elwise_sqrt(); // sqrt(abs + a) sqrt(abs - a)+    return factor.elwise_mult(res_re_im);+  }++  Vec256<ComplexDbl> reciprocal() const {+    // re + im*i = (a + bi)  / (c + di)+    // re = (ac + bd)/abs_2() = c/abs_2()+    // im = (bc - ad)/abs_2() = d/abs_2()+    auto c_d = *this ^ vd_isign_mask; // c       -d+    auto abs = abs_2_();+    return c_d.elwise_div(abs);+  }++  Vec256<ComplexDbl> rsqrt() const {+    return sqrt().reciprocal();+  }++  static Vec256<ComplexDbl> horizontal_add(+      Vec256<ComplexDbl>& first,+      Vec256<ComplexDbl>& second) {+    auto first_perm = first.el_swapped(); // 2perm+    auto second_perm = second.el_swapped(); // 2perm+    // summ+    auto first_ret = first + first_perm; // 2add+    auto second_ret = second + second_perm; // 2 add+    // now lets choose evens+    return el_mergee(first_ret, second_ret); // 2 mergee's+  }++  static Vec256<ComplexDbl> horizontal_sub(+      Vec256<ComplexDbl>& first,+      Vec256<ComplexDbl>& second) {+    // we will simulate it differently with 6 instructions total+    // lets permute second so that we can add it getting horizontall sums

'horizontal' - typo

quickwritereader

comment created time in a month

Pull request review commentpytorch/pytorch

Vsx initial support issue27678

++IF(CMAKE_SYSTEM_NAME MATCHES "Linux")+   message("<FindVSX>")++   EXEC_PROGRAM(LD_SHOW_AUXV=1 ARGS "/bin/true" OUTPUT_VARIABLE bintrue) +   if(bintrue MATCHES "AT_PLATFORM:[ \\t\\n\\r]*([a-zA-Z0-9_]+)[ \\t\\n\\r]*")+    if(CMAKE_MATCH_COUNT GREATER 0)+     string(TOLOWER ${CMAKE_MATCH_1} platform)+     if(${platform} MATCHES "^power")+     message("POWER Platform: ${platform}")+         SET(POWER_COMP TRUE CACHE BOOL "power ")+         SET(CXX_VSX_FLAGS  "${CXX_VSX_FLAGS} -mcpu=${platform} -mtune=${platform}" )+     endif()+   endif()+   endif()+   if(POWER_COMP AND bintrue MATCHES "AT_HWCAP:.*(vsx).*")

What's the minimum CPU version for Pytorch on Power? If it's POWER7 and above, you can assume VSX is present.

quickwritereader

comment created time in a month

Pull request review commentpytorch/pytorch

Vsx initial support issue27678

+#pragma once+#include <ATen/cpu/vec256/intrinsics.h>+#include <ATen/cpu/vec256/vec256_base.h>+#include <ATen/cpu/vec256/vsx/vsx_helpers.h>+#include <c10/util/complex.h>++namespace at {+namespace vec256 {+// See Note [Acceptable use of anonymous namespace in header]+namespace {+using ComplexDbl = c10::complex<double>;++template <>+class Vec256<ComplexDbl> {+  union {+    struct {+      __vd _vec0;+      __vd _vec1;+    };+    struct {+      __vllb _vecb0;+      __vllb _vecb1;+    };++  } __attribute__((__may_alias__));++ public:+  using value_type = ComplexDbl;+  using vec_internal_type = __vd;+  using vec_internal_mask_type = __vllb;+  static constexpr int size() {+    return 2;+  }+  Vec256() {}+  __inline_attrs Vec256(__vd v) : _vec0{v}, _vec1{v} {}+  __inline_attrs Vec256(__vllb vmask) : _vecb0{vmask}, _vecb1{vmask} {}+  __inline_attrs Vec256(__vd v1, __vd v2) : _vec0{v1}, _vec1{v2} {}+  __inline_attrs Vec256(__vllb v1, __vllb v2) : _vecb0{v1}, _vecb1{v2} {}++  Vec256(ComplexDbl val) {+    double real_value = val.real();+    double imag_value = val.imag();+    _vec0 = __vd{real_value, imag_value};+    _vec1 = __vd{real_value, imag_value};+  }+  Vec256(ComplexDbl val1, ComplexDbl val2) {+    _vec0 = __vd{val1.real(), val1.imag()};+    _vec1 = __vd{val2.real(), val2.imag()};+  }++  inline __inline_attrs const vec_internal_type& vec0() const {+    return _vec0;+  }+  inline __inline_attrs const vec_internal_type& vec1() const {+    return _vec1;+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 0, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return a;+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 1, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return b;+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 2, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return {b._vec0, a._vec1};+  }++  template <int64_t mask>+  static std::enable_if_t<blendChoiceComplexDbl(mask) == 3, Vec256<ComplexDbl>>+      __inline_attrs+      blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    return {a._vec0, b._vec1};+  }++  template <int64_t mask>+  static Vec256<ComplexDbl> __inline_attrs+  el_blend(const Vec256<ComplexDbl>& a, const Vec256<ComplexDbl>& b) {+    const __vllb mask_1st = VsxDblMask1(mask);+    const __vllb mask_2nd = VsxDblMask2(mask);+    return {+        (__vd)vec_sel(a._vec0, b._vec0, mask_1st),+        (__vd)vec_sel(a._vec1, b._vec1, mask_2nd)};+  }++  static Vec256<ComplexDbl> blendv(+      const Vec256<ComplexDbl>& a,+      const Vec256<ComplexDbl>& b,+      const Vec256<ComplexDbl>& mask) {+    // convert std::complex<V> index mask to V index mask: xy -> xxyy+    auto mask_complex =+        Vec256<ComplexDbl>(vec_splat(mask._vec0, 0), vec_splat(mask._vec1, 0));+    return {+        vec_sel(a._vec0, b._vec0, mask_complex._vecb0),+        vec_sel(a._vec1, b._vec1, mask_complex._vecb1)};+  }++  static Vec256<ComplexDbl> __inline_attrs elwise_blendv(+      const Vec256<ComplexDbl>& a,+      const Vec256<ComplexDbl>& b,+      const Vec256<ComplexDbl>& mask) {+    return {+        vec_sel(a._vec0, b._vec0, mask._vecb0),+        vec_sel(a._vec1, b._vec1, mask._vecb1)};+  }+  template <typename step_t>+  static Vec256<ComplexDbl> arange(+      ComplexDbl base = 0.,+      step_t step = static_cast<step_t>(1)) {+    return Vec256<ComplexDbl>(base, base + step);+  }+  static Vec256<ComplexDbl> set(+      const Vec256<ComplexDbl>& a,+      const Vec256<ComplexDbl>& b,+      int64_t count = size()) {+    switch (count) {+      case 0:+        return a;+      case 1:+        return blend<1>(a, b);+    }+    return b;+  }++  static Vec256<value_type> __inline_attrs+  loadu(const void* ptr, int count = size()) {+    if (count == size()) {+      return {+          vec_vsx_ld(offset0, reinterpret_cast<const double*>(ptr)),+          vec_vsx_ld(offset16, reinterpret_cast<const double*>(ptr))};+    }++    __at_align32__ value_type tmp_values[size()];+    std::memcpy(tmp_values, ptr, std::min(count, size()) * sizeof(value_type));++    return {+        vec_vsx_ld(offset0, reinterpret_cast<const double*>(tmp_values)),+        vec_vsx_ld(offset16, reinterpret_cast<const double*>(tmp_values))};+  }+  void __inline_attrs store(void* ptr, int count = size()) const {+    if (count == size()) {+      vec_vsx_st(_vec0, offset0, reinterpret_cast<double*>(ptr));+      vec_vsx_st(_vec1, offset16, reinterpret_cast<double*>(ptr));+    } else if (count > 0) {+      __at_align32__ value_type tmp_values[size()];+      vec_vsx_st(_vec0, offset0, reinterpret_cast<double*>(tmp_values));+      vec_vsx_st(_vec1, offset16, reinterpret_cast<double*>(tmp_values));+      std::memcpy(+          ptr, tmp_values, std::min(count, size()) * sizeof(value_type));+    }+  }++  const ComplexDbl& operator[](int idx) const = delete;+  ComplexDbl& operator[](int idx) = delete;+  Vec256<ComplexDbl> map(ComplexDbl (*f)(const ComplexDbl&)) const {+    __at_align32__ ComplexDbl tmp[size()];+    store(tmp);+    for (int i = 0; i < size(); i++) {+      tmp[i] = f(tmp[i]);+    }+    return loadu(tmp);+  }++  Vec256<ComplexDbl> el_swapped() const {+    __vd v0 = vec_xxpermdi(_vec0, _vec0, 2);+    __vd v1 = vec_xxpermdi(_vec1, _vec1, 2);+    return {v0, v1};+  }++  Vec256<ComplexDbl> el_madd(+      const Vec256<ComplexDbl>& multiplier,+      const Vec256<ComplexDbl>& val) const {+    return {+        vec_madd(_vec0, multiplier._vec0, val._vec0),+        vec_madd(_vec1, multiplier._vec1, val._vec1)};+  }++  Vec256<ComplexDbl> el_mergeo() const {+    __vd v0 = vec_splat(_vec0, 1);+    __vd v1 = vec_splat(_vec1, 1);+    return {v0, v1};+  }++  Vec256<ComplexDbl> el_mergee() const {+    __vd v0 = vec_splat(_vec0, 0);+    __vd v1 = vec_splat(_vec1, 0);+    return {v0, v1};+  }++  static Vec256<ComplexDbl> el_mergee(+      Vec256<ComplexDbl>& first,+      Vec256<ComplexDbl>& second) {+    // as mergee phased in , we can use vec_perm with mask+    return {+        vec_mergeh(first._vec0, second._vec0),+        vec_mergeh(first._vec1, second._vec1)};+  }++  Vec256<ComplexDbl> abs_2_() const {+    auto a = (*this).elwise_mult(*this);+    auto permuted = a.el_swapped();+    a = a + permuted;+    return a;+  }++  Vec256<ComplexDbl> abs_() const {+    auto ret = abs_2_();+    return ret.elwise_sqrt();+  }++  Vec256<ComplexDbl> abs() const {+    return abs_() & vd_real_mask;+  }++  Vec256<ComplexDbl> angle_() const {+    // angle = atan2(b/a)+    // auto b_a = _mm256_permute_pd(values, 0x05);     // b        a+    // return Sleef_atan2d4_u10(values, b_a);          // 90-angle angle+    auto ret = el_swapped();+    for (int i = 0; i < 2; i++) {+      ret._vec0[i] = std::atan2(_vec0[i], ret._vec0[i]);+      ret._vec1[i] = std::atan2(_vec1[i], ret._vec0[i]);+    }+    return ret;+  }++  Vec256<ComplexDbl> angle() const {+    auto a = angle_().el_swapped();+    return a & vd_real_mask;+  }++  Vec256<ComplexDbl> real_() const {+    return *this & vd_real_mask;+  }+  Vec256<ComplexDbl> real() const {+    return *this & vd_real_mask;+  }+  Vec256<ComplexDbl> imag_() const {+    return *this & vd_imag_mask;+  }+  Vec256<ComplexDbl> imag() const {+    return imag_().el_swapped();+  }++  Vec256<ComplexDbl> conj_() const {+    return *this ^ vd_isign_mask;+  }+  Vec256<ComplexDbl> conj() const {+    return *this ^ vd_isign_mask;+  }++  Vec256<ComplexDbl> log() const {+    // Most trigonomic ops use the log() op to improve complex number+    // performance.+    return map(std::log);+  }++  Vec256<ComplexDbl> log2() const {+    // log2eB_inv+    auto ret = log();+    return ret.elwise_mult(vd_log2e_inv);+  }+  Vec256<ComplexDbl> log10() const {+    auto ret = log();+    return ret.elwise_mult(vd_log10e_inv);+  }++  Vec256<ComplexDbl> asin() const {+    // asin(x)+    // = -i*ln(iz + sqrt(1 -z^2))+    // = -i*ln((ai - b) + sqrt(1 - (a + bi)*(a + bi)))+    // = -i*ln((-b + ai) + sqrt(1 - (a**2 - b**2) - 2*abi))+    auto conj = conj_();+    auto b_a = conj.el_swapped();+    auto ab = conj.elwise_mult(b_a);+    auto im = ab + ab;+    auto val_2 = (*this).elwise_mult(*this);+    auto val_2_swapped = val_2.el_swapped();+    auto re = horizontal_sub(val_2, val_2_swapped);+    re = Vec256<ComplexDbl>(vd_one) - re;+    auto root = el_blend<0x0A>(re, im).sqrt();+    auto ln = (b_a + root).log();+    return ln.el_swapped().conj();+  }++  Vec256<ComplexDbl> acos() const {+    // acos(x) = pi/2 - asin(x)+    return Vec256(vd_pi_2) - asin();+  }++  Vec256<ComplexDbl> atan() const {+    // atan(x) = i/2 * ln((i + z)/(i - z))+    auto ione = Vec256(vd_imag_one);+    auto sum = ione + *this;+    auto sub = ione - *this;+    auto ln = (sum / sub).log(); // ln((i + z)/(i - z))+    return ln * vd_imag_half; // i/2*ln()+  }++  Vec256<ComplexDbl> sin() const {+    return map(std::sin);+  }+  Vec256<ComplexDbl> sinh() const {+    return map(std::sinh);+  }+  Vec256<ComplexDbl> cos() const {+    return map(std::cos);+  }+  Vec256<ComplexDbl> cosh() const {+    return map(std::cosh);+  }++  Vec256<ComplexDbl> tan() const {+    return map(std::tan);+  }+  Vec256<ComplexDbl> tanh() const {+    return map(std::tanh);+  }+  Vec256<ComplexDbl> ceil() const {+    return {vec_ceil(_vec0), vec_ceil(_vec1)};+  }+  Vec256<ComplexDbl> floor() const {+    return {vec_floor(_vec0), vec_floor(_vec1)};+  }+  Vec256<ComplexDbl> neg() const {+    auto z = Vec256<ComplexDbl>(vd_zero);+    return z - *this;+  }+  Vec256<ComplexDbl> round() const {+    return {vec_rint(_vec0), vec_rint(_vec1)};+  }++  Vec256<ComplexDbl> trunc() const {+    return {vec_trunc(_vec0), vec_trunc(_vec1)};+  }++  Vec256<ComplexDbl> elwise_sqrt() const {+    return {vec_sqrt(_vec0), vec_sqrt(_vec1)};+  }++  void dump() const {+    std::cout << _vec0[0] << "," << _vec0[1] << ",";+    std::cout << _vec1[0] << "," << _vec1[1] << std::endl;+  }++  Vec256<ComplexDbl> sqrt() const {+    //   sqrt(a + bi)+    // = sqrt(2)/2 * [sqrt(sqrt(a**2 + b**2) + a) + sgn(b)*sqrt(sqrt(a**2 ++    // b**2) - a)i] = sqrt(2)/2 * [sqrt(abs() + a) + sgn(b)*sqrt(abs() - a)i]++    auto sign = *this & vd_isign_mask;+    auto factor = sign | vd_sqrt2_2;+    auto a_a = el_mergee();+    // a_a.dump();+    a_a = a_a ^ vd_isign_mask; // a -a+    auto res_re_im = (abs_() + a_a).elwise_sqrt(); // sqrt(abs + a) sqrt(abs - a)+    return factor.elwise_mult(res_re_im);+  }++  Vec256<ComplexDbl> reciprocal() const {+    // re + im*i = (a + bi)  / (c + di)+    // re = (ac + bd)/abs_2() = c/abs_2()+    // im = (bc - ad)/abs_2() = d/abs_2()+    auto c_d = *this ^ vd_isign_mask; // c       -d+    auto abs = abs_2_();+    return c_d.elwise_div(abs);+  }++  Vec256<ComplexDbl> rsqrt() const {+    return sqrt().reciprocal();+  }++  static Vec256<ComplexDbl> horizontal_add(+      Vec256<ComplexDbl>& first,+      Vec256<ComplexDbl>& second) {+    auto first_perm = first.el_swapped(); // 2perm+    auto second_perm = second.el_swapped(); // 2perm+    // summ+    auto first_ret = first + first_perm; // 2add+    auto second_ret = second + second_perm; // 2 add+    // now lets choose evens+    return el_mergee(first_ret, second_ret); // 2 mergee's+  }++  static Vec256<ComplexDbl> horizontal_sub(+      Vec256<ComplexDbl>& first,+      Vec256<ComplexDbl>& second) {+    // we will simulate it differently with 6 instructions total+    // lets permute second so that we can add it getting horizontall sums+    auto first_perm = first.el_swapped(); // 2perm+    auto second_perm = second.el_swapped(); // 2perm+    // summ+    auto first_ret = first - first_perm; // 2sub+    auto second_ret = second - second_perm; // 2 sub+    // now lets choose evens+    return el_mergee(first_ret, second_ret); // 2 mergee's+  }++  Vec256<ComplexDbl> inline operator*(const Vec256<ComplexDbl>& b) const {+    //(a + bi)  * (c + di) = (ac - bd) + (ad + bc)i+#if 1+    // this is more vsx friendly than simulating horizontall from x86

Same here.

quickwritereader

comment created time in a month

Pull request review commentpytorch/pytorch

Vsx initial support issue27678

+#pragma once++#include <ATen/cpu/vec256/intrinsics.h>+#include <ATen/cpu/vec256/vec256_base.h>+#include <ATen/cpu/vec256/vsx/vsx_helpers.h>+#include <ATen/cpu/vec256/vsx/vec256_double_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_float_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_int16_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_int32_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_int64_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_qint32_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_qint8_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_quint8_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_complex_float_vsx.h>+#include <ATen/cpu/vec256/vsx/vec256_complex_double_vsx.h>+namespace at {+namespace vec256 {++namespace {++DEFINE_CLAMP_FUNCS(c10::quint8)+DEFINE_CLAMP_FUNCS(c10::qint8)+DEFINE_CLAMP_FUNCS(c10::qint32)+DEFINE_CLAMP_FUNCS(int16_t)+DEFINE_CLAMP_FUNCS(int32_t)+DEFINE_CLAMP_FUNCS(int64_t)+DEFINE_CLAMP_FUNCS(float)+DEFINE_CLAMP_FUNCS(double)++template <>+Vec256<double> inline __inline_attrs fmadd(+    const Vec256<double>& a,+    const Vec256<double>& b,+    const Vec256<double>& c) {+  return Vec256<double>{+      vec_madd(a.vec0(), b.vec0(), c.vec0()),+      vec_madd(a.vec1(), b.vec1(), c.vec1())};+}++template <>+Vec256<int64_t> inline __inline_attrs fmadd(+    const Vec256<int64_t>& a,+    const Vec256<int64_t>& b,+    const Vec256<int64_t>& c) {+  return Vec256<int64_t>{+      a.vec0() * b.vec0() + c.vec0(), a.vec1() * b.vec1() + c.vec1()};+}+template <>+Vec256<int32_t> inline __inline_attrs fmadd(+    const Vec256<int32_t>& a,+    const Vec256<int32_t>& b,+    const Vec256<int32_t>& c) {+  return Vec256<int32_t>{+      a.vec0() * b.vec0() + c.vec0(), a.vec1() * b.vec1() + c.vec1()};+}+template <>+Vec256<int16_t> inline __inline_attrs fmadd(+    const Vec256<int16_t>& a,+    const Vec256<int16_t>& b,+    const Vec256<int16_t>& c) {+  return Vec256<int16_t>{+      a.vec0() * b.vec0() + c.vec0(), a.vec1() * b.vec1() + c.vec1()};+}++DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(float)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(double)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(int64_t)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(int32_t)+DEFINE_REINTERPRET_CAST_TO_ALL_FUNCS(int16_t)++template <>+Vec256<int64_t> inline __inline_attrs convert_to_int_of_same_size<double>(+    const Vec256<double>& src) {+  return Vec256<int64_t>{vec_signed(src.vec0()), vec_signed(src.vec1())};+}++template <>+Vec256<int32_t> inline __inline_attrs convert_to_int_of_same_size<float>(+    const Vec256<float>& src) {+  return Vec256<int32_t>{vec_signed(src.vec0()), vec_signed(src.vec1())};+}++template <>+inline void convert(const int32_t* src, float* dst, int64_t n) {+  // int32_t and float have same size+  size_t N = n;+  size_t NE = N & (~(Vec256<int32_t>::size() - 1));+  for (size_t i = 0; i < NE; i += Vec256<int32_t>::size()) {+    const int32_t* src_a = src + i;+    float* dst_a = dst + i;+    __vi input_vec0 = vec_vsx_ld(offset0, reinterpret_cast<const __vi*>(src_a));+    __vi input_vec1 =+        vec_vsx_ld(offset16, reinterpret_cast<const __vi*>(src_a));+    vec_vsx_st(vec_float(input_vec0), offset0, dst_a);

vec_float was introduced in gcc-8.0. This will require a compiler check.

quickwritereader

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

issue commentxianyi/OpenBLAS

BFloat16 data type naming

The naming was coordinated with and approved by Jack Dongarra, who is the leader of NETLIB. OpenBLAS cannot start inventing names.

OK, but still this 'sh' prefix is quite confusing, and it is totally different with other math libs supporting BFloat16. Any channel to raise a concern to Jack? And any place we can find netlib's spec about BF16 based APIs or naming?

Some clarification: at the time the decision was made, the only half precision type being considered was bfloat16. That motivated choosing 'H' for the type.

I agree that, if OpenBLAS is going to support IEEE 16-bit, this can be confusing. I have no objection changing it, as long all the parties involved are in agreement.

Guobing-Chen

comment created time in 2 months

issue commentxianyi/OpenBLAS

Adding POWER9 to travis CI

@martin-frbg I think for the purposes of a functional CI, Travis is enough for now. I'm trying to contact someone at OSU to look at your Jenkins as well.

RajalakshmiSR

comment created time in 2 months

issue commentxianyi/OpenBLAS

Adding POWER9 to travis CI

@martin-frbg excellent, thanks!

So, you can't do anything in the Jenkins dashboard here? If not, I can ask someone there to walk you through this.

RajalakshmiSR

comment created time in 2 months

issue commentxianyi/OpenBLAS

Adding POWER9 to travis CI

@martin-frbg which request form did you submit? The POWER CI form, or the OpenStack one?

Regarding Travis, it seems the problem is that the build script is passing -mcpu=power8, as you can see here

RajalakshmiSR

comment created time in 2 months

issue commentxianyi/OpenBLAS

Adding POWER9 to travis CI

@martin-frbg cool! If you need help setting up a P9 at OSU, please let me know.

RajalakshmiSR

comment created time in 2 months

issue commentxianyi/OpenBLAS

Adding POWER9 to travis CI

@martin-frbg I think that, as a community, submitting a request for adding P9 to Travis CI wouldn't hurt. I understand that may take a while to happen.

Meanwhile, adding a way to test on P9 in the workflow sounds like a very good idea. Is it something that you would consider?

RajalakshmiSR

comment created time in 2 months

create barnchceseo/stdarch

branch : power9-crypto

created branch time in 2 months

fork ceseo/stdarch

Rust's standard library vendor-specific APIs and run-time feature detection

https://doc.rust-lang.org/stable/core/arch/

fork in 2 months

fork ceseo/rust

Empowering everyone to build reliable and efficient software.

https://www.rust-lang.org

fork in 2 months

pull request commentbazelbuild/bazel

Fixes build issue with JDK headers location on ppc64le.

@philwo what would be required for a ppc64le CI? I could help with hardware access if necessary. In Golang we have a bunch of systems hosted at Oregon State University. Does that interest you?

ceseo

comment created time in 3 months

CommitCommentEvent

push eventceseo/bazel

Carlos Eduardo Seo

commit sha bc22ab4068c12b15712db7e3c37f8c70cd04320b

Fixes build issue with JDK headers location on ppc64le.

view details

push time in 3 months

PR opened bazelbuild/bazel

Fixes build issue with JDK headers location on ppc64/ppc64le.

Fixes #11643.

+4 -0

0 comment

1 changed file

pr created time in 3 months

push eventceseo/bazel

mschaller

commit sha ddaea6ae488ec54bd118c7c27176042763a993bc

Remove undetailed BuildFailedException constructors Encodes execution-phase cycle and non-action failures with FailureDetails. Drive-by detailing of a new CleanCommand failure mode. RELNOTES: None. PiperOrigin-RevId: 318347785

view details

mschaller

commit sha 2634de1aa2f4533be718ed75dd5d21c49355b790

Encode missing input file failures with FailureDetails Also refactors missing file message functions so that fewer MissingInputFileExceptions are needlessly constructed. RELNOTES: None. PiperOrigin-RevId: 318356644

view details

mschaller

commit sha 75216c74470090c6f720814dbee239ae4102290e

Remove remaining undetailed BuildFailedException constructor Some loading-phase-related failure modes have placeholder detailed codes to be improved when loading-phase failures are detailed. RELNOTES: None. PiperOrigin-RevId: 318366815

view details

Jingwen Chen

commit sha 3b0439e37247a480e08337a6314d06231bdbafd3

Fix incorrect assumption of desugar persistent worker conditional Fixes https://github.com/bazelbuild/bazel/issues/11618 Needs to be cherry-picked into 3.3.0. Closes #11620. PiperOrigin-RevId: 318428300

view details

Googler

commit sha 8f047619850c78d4e73b7c9346c7204d6a19f4a7

Prefetch inputs before acquiring a worker This reduces the how long we hold exclusive access to a worker and will make it easier to measure the queuing time (acquiring both a worker and the necessary resources). RELNOTES: None. PiperOrigin-RevId: 318442224

view details

lberki

commit sha 89b86e059f900438f7ad9db583aed900b7d60832

Fix crash that happens when an aspect has the same implicit attribute as the rule that defines it. RELNOTES: None. PiperOrigin-RevId: 318456188

view details

dlr

commit sha 577d907a9249f4976d238052c67160f96d323ccc

Avoid possibility of the default Locale interfering with lower-casing input. Instead of String.toLowerCase(), leverage Guava's Ascii.toLowerCase(). Also, factor out common text representations of enabled/disabled values for boolean and tri-state flags, and simplify corresponding implementations. https://help.semmle.com/wiki/pages/viewpage.action?pageId=29393598 https://www.w3.org/International/wiki/Case_folding RELNOTES: none PiperOrigin-RevId: 318499276

view details

mschaller

commit sha 6fba77d4a9a1fae21f1c3267c8e73d43ca872588

Encode remaining SimpleSpawnResult failures with FailureDetails RELNOTES: None. PiperOrigin-RevId: 318514673

view details

Googler

commit sha ddb9b9a60c4df7defcac7c7141dff5eaf411b1f4

Shortened the error log for unknown Starlark options. PiperOrigin-RevId: 318526131

view details

Googler

commit sha 7c3817b501fd3eb72f332f157397fc6c2dbf71d6

Add a reference equality check to Label#equals. PiperOrigin-RevId: 318530598

view details

laurentlb

commit sha 65ed16c871ff7b7c9707a4dc6d50e615f939baf8

Revert the Starlark debugger flags to their old name. IDEs are using those flags, we don't want to break them now. We could have set oldName so that both names would be supported. However, we plan to make the flags non experimental in the future. It's not worth doing two renamings, and we can live with the old names for now. RELNOTES: None. PiperOrigin-RevId: 318534237

view details

adonovan

commit sha d101fc87b65ff45f87bccfdf93c2bdbddd7dc4f5

bazel syntax: reject x<<y if y > 31 Because Java's shift operators only use the low 5 bits of y, we cannot check for overflow simply by unshifting and comparing. Also, detect overflow of integer division (MININT // -1). BUG=159942010,159946493 PiperOrigin-RevId: 318537836

view details

Nikhil Marathe

commit sha f8a94b9decf4c76c217342ce31a05ec30cfb77e8

Restore macOS 10.10 version compatibility to Bazel This is part of the fixes for #11442. (Not a complete fix because it does not address the bazel 2 branch). It makes the macOS deployment target 10.10. Then it wraps the calls to the macOS logging infrastructure so they only execute on macOS 10.12+. I've tested this change and a similar cherry-pick on tag 2.2.0 and confirmed this results in a functioning Bazel (at least for the targets we build) on macOS 10.10. I have questions around what I can do better: 1. I have just added a universal .bazelrc entry. Is there some way to gate this to just macOS in the Bazel infrastructure? Is that even necessary? 2. The Bazel tagged versions are just tags and not branches, which means I cannot submit a PR for a backport to Bazel 2.2. How would you like to handle this? Closes #11565. PiperOrigin-RevId: 318770522

view details

Googler

commit sha 27a5c74f4e716e0044a8407353244f0ed989e686

Populate SpawnMetrics with more metrics in WorkerSpawnRunner The extra metrics that we'll collect in `SpawnMetrics` (on top of total time that we already collect): * Setup time (e.g., prefetching inputs, setting up sandbox, etc.) * Queue time (how long it took to acquire a worker/resources) * Execution time of the worker (time between sending a request and receiving a response) * Time to process outputs * Number of inputs RELNOTES: Collect more performance metrics for worker execution. PiperOrigin-RevId: 318774173

view details

David Ostrovsky

commit sha 239b2aab17cc1f007b2221ada9074bbe0c58db88

Bump error prone to release 2.4.0 Signed-off-by: Philipp Wollermann <philwo@google.com>

view details

Grzegorz Lukasik

commit sha 3e58cca5a72aa72ecfcaed6cee2393e59e284325

Updated expansion for remote_download_outputs It's no longer experimental, changed code and documentation from: `experimental_remote_download_outputs` to `remote_download_outputs` The change to non-experimental was in https://github.com/bazelbuild/bazel/commit/bb26694806c9a9c2b3f39100ddd4b00e0d8285cd Closes #11651. PiperOrigin-RevId: 318794911

view details

laurentlb

commit sha b8bb68e948f0ab5b0d19f4d560ccf29d3b1f39ef

Merge StarlarkIndexable and StarlarkQueryable classes RELNOTES: None. PiperOrigin-RevId: 318801861

view details

plf

commit sha 812595b53d243fc2b556c09d35fd9ca26664b190

Automated rollback of commit bd7999eed71fc14c68890553c68fa5a8f608922c. *** Reason for rollback *** Breaks targets in nightly: [] which are not passing the grep includes binary. *** Original change description *** C++: Give error when grep-includes missing This was crashing during execution of compilation actions. RELNOTES:none PiperOrigin-RevId: 318803927

view details

Philipp Wollermann

commit sha 5585a182659fb45720283dae3a36a05a09c2e54e

Bump javac11 java_tools to v9.0 Closes #11665. PiperOrigin-RevId: 318804764

view details

Andrzej Guszak

commit sha d814a0b779bf2948d364bc5dcba509b66be8610d

Add includes param for the Starlark version of cc_import This change allows users to add a list of include dirs when using Starlark version of cc_import rule (currently hidden behind '--experimental_starlark_cc_import' flag). Closes #11647. PiperOrigin-RevId: 318812537

view details

push time in 3 months

issue commentbazelbuild/bazel

bazel-3.3.0 fails to bootstrap on ppc64le

@redsigma Thanks.

If the maintainers are interested, this fixes the build for both ppc64/ppc64le: https://github.com/ceseo/bazel/commit/2c49a93a13424a899556453e1dfad83168877433

I already have a Google CLA signed for golang, so I think I'm clear.

ceseo

comment created time in 3 months

push eventceseo/bazel

mschaller

commit sha ddaea6ae488ec54bd118c7c27176042763a993bc

Remove undetailed BuildFailedException constructors Encodes execution-phase cycle and non-action failures with FailureDetails. Drive-by detailing of a new CleanCommand failure mode. RELNOTES: None. PiperOrigin-RevId: 318347785

view details

mschaller

commit sha 2634de1aa2f4533be718ed75dd5d21c49355b790

Encode missing input file failures with FailureDetails Also refactors missing file message functions so that fewer MissingInputFileExceptions are needlessly constructed. RELNOTES: None. PiperOrigin-RevId: 318356644

view details

mschaller

commit sha 75216c74470090c6f720814dbee239ae4102290e

Remove remaining undetailed BuildFailedException constructor Some loading-phase-related failure modes have placeholder detailed codes to be improved when loading-phase failures are detailed. RELNOTES: None. PiperOrigin-RevId: 318366815

view details

Jingwen Chen

commit sha 3b0439e37247a480e08337a6314d06231bdbafd3

Fix incorrect assumption of desugar persistent worker conditional Fixes https://github.com/bazelbuild/bazel/issues/11618 Needs to be cherry-picked into 3.3.0. Closes #11620. PiperOrigin-RevId: 318428300

view details

Googler

commit sha 8f047619850c78d4e73b7c9346c7204d6a19f4a7

Prefetch inputs before acquiring a worker This reduces the how long we hold exclusive access to a worker and will make it easier to measure the queuing time (acquiring both a worker and the necessary resources). RELNOTES: None. PiperOrigin-RevId: 318442224

view details

lberki

commit sha 89b86e059f900438f7ad9db583aed900b7d60832

Fix crash that happens when an aspect has the same implicit attribute as the rule that defines it. RELNOTES: None. PiperOrigin-RevId: 318456188

view details

dlr

commit sha 577d907a9249f4976d238052c67160f96d323ccc

Avoid possibility of the default Locale interfering with lower-casing input. Instead of String.toLowerCase(), leverage Guava's Ascii.toLowerCase(). Also, factor out common text representations of enabled/disabled values for boolean and tri-state flags, and simplify corresponding implementations. https://help.semmle.com/wiki/pages/viewpage.action?pageId=29393598 https://www.w3.org/International/wiki/Case_folding RELNOTES: none PiperOrigin-RevId: 318499276

view details

mschaller

commit sha 6fba77d4a9a1fae21f1c3267c8e73d43ca872588

Encode remaining SimpleSpawnResult failures with FailureDetails RELNOTES: None. PiperOrigin-RevId: 318514673

view details

Googler

commit sha ddb9b9a60c4df7defcac7c7141dff5eaf411b1f4

Shortened the error log for unknown Starlark options. PiperOrigin-RevId: 318526131

view details

Googler

commit sha 7c3817b501fd3eb72f332f157397fc6c2dbf71d6

Add a reference equality check to Label#equals. PiperOrigin-RevId: 318530598

view details

laurentlb

commit sha 65ed16c871ff7b7c9707a4dc6d50e615f939baf8

Revert the Starlark debugger flags to their old name. IDEs are using those flags, we don't want to break them now. We could have set oldName so that both names would be supported. However, we plan to make the flags non experimental in the future. It's not worth doing two renamings, and we can live with the old names for now. RELNOTES: None. PiperOrigin-RevId: 318534237

view details

adonovan

commit sha d101fc87b65ff45f87bccfdf93c2bdbddd7dc4f5

bazel syntax: reject x<<y if y > 31 Because Java's shift operators only use the low 5 bits of y, we cannot check for overflow simply by unshifting and comparing. Also, detect overflow of integer division (MININT // -1). BUG=159942010,159946493 PiperOrigin-RevId: 318537836

view details

Nikhil Marathe

commit sha f8a94b9decf4c76c217342ce31a05ec30cfb77e8

Restore macOS 10.10 version compatibility to Bazel This is part of the fixes for #11442. (Not a complete fix because it does not address the bazel 2 branch). It makes the macOS deployment target 10.10. Then it wraps the calls to the macOS logging infrastructure so they only execute on macOS 10.12+. I've tested this change and a similar cherry-pick on tag 2.2.0 and confirmed this results in a functioning Bazel (at least for the targets we build) on macOS 10.10. I have questions around what I can do better: 1. I have just added a universal .bazelrc entry. Is there some way to gate this to just macOS in the Bazel infrastructure? Is that even necessary? 2. The Bazel tagged versions are just tags and not branches, which means I cannot submit a PR for a backport to Bazel 2.2. How would you like to handle this? Closes #11565. PiperOrigin-RevId: 318770522

view details

Googler

commit sha 27a5c74f4e716e0044a8407353244f0ed989e686

Populate SpawnMetrics with more metrics in WorkerSpawnRunner The extra metrics that we'll collect in `SpawnMetrics` (on top of total time that we already collect): * Setup time (e.g., prefetching inputs, setting up sandbox, etc.) * Queue time (how long it took to acquire a worker/resources) * Execution time of the worker (time between sending a request and receiving a response) * Time to process outputs * Number of inputs RELNOTES: Collect more performance metrics for worker execution. PiperOrigin-RevId: 318774173

view details

David Ostrovsky

commit sha 239b2aab17cc1f007b2221ada9074bbe0c58db88

Bump error prone to release 2.4.0 Signed-off-by: Philipp Wollermann <philwo@google.com>

view details

Grzegorz Lukasik

commit sha 3e58cca5a72aa72ecfcaed6cee2393e59e284325

Updated expansion for remote_download_outputs It's no longer experimental, changed code and documentation from: `experimental_remote_download_outputs` to `remote_download_outputs` The change to non-experimental was in https://github.com/bazelbuild/bazel/commit/bb26694806c9a9c2b3f39100ddd4b00e0d8285cd Closes #11651. PiperOrigin-RevId: 318794911

view details

laurentlb

commit sha b8bb68e948f0ab5b0d19f4d560ccf29d3b1f39ef

Merge StarlarkIndexable and StarlarkQueryable classes RELNOTES: None. PiperOrigin-RevId: 318801861

view details

plf

commit sha 812595b53d243fc2b556c09d35fd9ca26664b190

Automated rollback of commit bd7999eed71fc14c68890553c68fa5a8f608922c. *** Reason for rollback *** Breaks targets in nightly: [] which are not passing the grep includes binary. *** Original change description *** C++: Give error when grep-includes missing This was crashing during execution of compilation actions. RELNOTES:none PiperOrigin-RevId: 318803927

view details

Philipp Wollermann

commit sha 5585a182659fb45720283dae3a36a05a09c2e54e

Bump javac11 java_tools to v9.0 Closes #11665. PiperOrigin-RevId: 318804764

view details

Andrzej Guszak

commit sha d814a0b779bf2948d364bc5dcba509b66be8610d

Add includes param for the Starlark version of cc_import This change allows users to add a list of include dirs when using Starlark version of cc_import rule (currently hidden behind '--experimental_starlark_cc_import' flag). Closes #11647. PiperOrigin-RevId: 318812537

view details

push time in 3 months

create barnchceseo/bazel

branch : ppc64-build

created branch time in 3 months

more