profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/umar456/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Umar Arshad umar456 ArrayFire Atlanta, GA

arrayfire/arrayfire-haskell 52

Haskell bindings to ArrayFire

umar456/SIMD 6

SIMD helper classes and functions

umar456/bin2cpp 4

Converts files from a binary to C++ headers. It is similar to bin2c and xxd but adds support for namespaces.

9prady9/arrayfire 1

ArrayFire: a general purpose GPU library.

umar456/multiarray 1

Easily run ArrayFire operations on multiple GPUs

umar456/VR_Recorder 1

Records the HTC Vive's absolute pose and front facing camera images.

umar456/arrayfire 0

ArrayFire: a general purpose GPU library.

umar456/arrayfire-data 0

ArrayFire Test Data

umar456/arrayfire-js 0

ArrayFire.js - ArrayFire for Node.js

issue commentarrayfire/arrayfire

void cuda::reorder slow when using .device to grab pointer

I am not sure I understand your question. It sounds to me like you are calling reorder in your code. I don't think the device call will call any reorder operations unless the operation is already part of the queue. Could you provide some code that demonstrates what you are seeing?

AndyP-APSensing

comment created time in 8 days

issue commentarrayfire/arrayfire

Unknown kernels in NVIDIA visual profiler

What you are seeing are kernels that have been generated at runtime. These kernels are based element wise operations on the af::array object. You can see the contents of these kernels if you set the AF_JIT_KERNEL_TRACE environment variable to a directory or stdout or stderr.

Example:

AF_JIT_KERNEL_TRACE=stdout ./your_application

AndyP-APSensing

comment created time in 8 days

issue closedarrayfire/arrayfire

Failed Tests when Compliling Against OpenCL on AWS f1.2xlarge Instance

I started a f1.2xlarge AWS instance and compiled Array the code per the build instructions. It looked like there might be some additional steps to compiling for an FPGA in the ArrayFire/xilinx_demo repo, but that has not been touched since 2016. Are FPGAs supported at any level or are there plans to support OpenCL for FPGAs? (It looks like you offer a training class.)

Description

  • I built ArrayFire using the the instructions located here (Small modifications shown below) https://github.com/arrayfire/arrayfire/wiki/Build-Instructions-for-Linux#opencl-backend-dependencies Amazon Linux 2 sudo yum install cmake3 boost-devel fftw-devel openblas-devel clone arrayfire repository mkdir build cd build cmake3 .. -DCMAKE_BUILD_TYPE=Release make test -j8
  • Which backend is experiencing this issue? OpenCL
  • Do you have a workaround? No
  • Can the bug be reproduced reliably on your system? Yes
  • A clear and concise description of what you expected to happen. When running test the opencl tests fail.
  • Run your executable with AF_TRACE=all and AF_PRINT_ERRORS=1 environment variables set.

The following is the cmake output

[ec2-user@ip-172-31-19-49 build]$ export AF_TRACE=all
[ec2-user@ip-172-31-19-49 build]$ export AF_PRINT_ERRORS=1
[ec2-user@ip-172-31-19-49 build]$ cmake3 .. -DCMAKE_BUILD_TYPE=Release
-- The C compiler identification is GNU 7.3.1
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "9.0")
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1")
-- Could NOT find cuDNN (missing: cuDNN_LINK_LIBRARY cuDNN_INCLUDE_DIRS) (Required is at least version "4.0")
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - found
-- Found OpenCL: /usr/lib64/libOpenCL.so (found suitable version "2.0", minimum required is "1.2")
-- Could NOT find OpenGL (missing: OPENGL_gl_LIBRARY OPENGL_INCLUDE_DIR)
-- Could NOT find FreeImage (missing: FreeImage_INCLUDE_DIR FreeImage_LINK_LIBRARY)
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Checking for module 'fftw3'
--   Found fftw3, version 3.3.3
-- Found FFTW: /usr/include
-- Checking for module 'cblas'
--   No package 'cblas' found
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of void*
-- Check size of void* - done
-- Checking for [Accelerate]
-- Checking for [vecLib]
-- Checking for [cblas - atlas]
-- Includes found
-- Checking for [openblas]
-- Includes found
-- Looking for cblas_dgemm
-- Looking for cblas_dgemm - found
-- CBLAS Symbols FOUND
-- CBLAS library found
-- Could NOT find LAPACK (missing: LAPACK_INCLUDE_DIR)
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- Check size of int
-- Check size of int - done
-- MKL: Thread Layer(Intel OpenMP) Interface(4-byte Integer)
-- Could NOT find MKL: Source the compilervars.sh or mklvars.sh scripts included with your installation of MKL. This script searches for the libraries in MKLROOT, LIBRARY_PATHS(Linux), and LIB(Windows) environment variables (missing: MKL_INCLUDE_DIR MKL_Core_LINK_LIBRARY MKL_Interface_LINK_LIBRARY MKL_ThreadLayer_LINK_LIBRARY)
-- Could NOT find MKL: Source the compilervars.sh or mklvars.sh scripts included with your installation of MKL. This script searches for the libraries in MKLROOT, LIBRARY_PATHS(Linux), and LIB(Windows) environment variables (missing: MKL_INCLUDE_DIR MKL_Core_STATIC_LINK_LIBRARY MKL_Interface_STATIC_LINK_LIBRARY MKL_ThreadLayer_STATIC_LINK_LIBRARY)
-- Boost version: 1.70.0
-- Performing Test has_ignored_attributes_flag
-- Performing Test has_ignored_attributes_flag - Success
-- Performing Test has_all_warnings_flag
-- Performing Test has_all_warnings_flag - Success
-- Found OpenCL: /usr/lib64/libOpenCL.so (found version "2.0")
-- UNICODE feature disabled on linux
-- 64bit build - FIND_LIBRARY_USE_LIB64_PATHS TRUE
CMake Warning at /usr/share/cmake3/Modules/FindBoost.cmake:880 (message):
  New Boost version may have incorrect or missing dependencies and imported
  targets
Call Stack (most recent call first):
  /usr/share/cmake3/Modules/FindBoost.cmake:1002 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake3/Modules/FindBoost.cmake:1670 (_Boost_MISSING_DEPENDENCIES)
  build/extern/ocl_clfft-src/src/CMakeLists.txt:129 (find_package)


-- Could NOT find Boost
CMake Warning at build/extern/ocl_clfft-src/src/CMakeLists.txt:133 (message):
  Try setting Boost_DEBUG and Boost_DETAILED_FAILURE_MSG for more information


-- Found OpenCL: /lib64/libOpenCL.so
-- Found FFTW: /lib64/libfftw3f.so;/lib64/libfftw3.so
-- Detected GNU fortran compiler.
-- CMAKE_CXX_COMPILER flags: -m64 -pthread
-- CMAKE_CXX_COMPILER debug flags: -g
-- CMAKE_CXX_COMPILER release flags: -O2 -DNDEBUG
-- CMAKE_CXX_COMPILER relwithdebinfo flags: -O2 -g -DNDEBUG
-- CMAKE_EXE_LINKER link flags:
FFT clients will NOT be built
GoogleTest unit tests will NOT be built
FFT callback client will NOT be built
-- Found OpenCL: /usr/lib64/libOpenCL.so (found version "2.0")
-- Found PythonInterp: /home/ec2-user/miniconda3/bin/python (found version "3.9.1")
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ec2-user/git/arrayfire/build
Build Seemed to work correctly (have the log if needed)
(base) [ec2-user@ip-172-31-19-49 build]$ make test
Running tests...
Test project /home/ec2-user/git/arrayfire/build
        Start   1: test_anisotropic_diffusion_cpu
  1/263 Test   #1: test_anisotropic_diffusion_cpu ........................   Passed    0.01 sec
        Start   2: test_anisotropic_diffusion_opencl
  2/263 Test   #2: test_anisotropic_diffusion_opencl .....................***Failed    0.03 sec
        Start   3: test_approx1_cpu
  3/263 Test   #3: test_approx1_cpu ......................................   Passed    2.08 sec
        Start   4: test_approx1_opencl
  4/263 Test   #4: test_approx1_opencl ...................................***Failed    0.11 sec
        Start   5: test_approx2_cpu
  5/263 Test   #5: test_approx2_cpu ......................................   Passed    2.52 sec
        Start   6: test_approx2_opencl
  6/263 Test   #6: test_approx2_opencl ...................................***Failed    0.14 sec
        Start   7: test_array_cpu
  7/263 Test   #7: test_array_cpu ........................................   Passed    0.02 sec
        Start   8: test_array_opencl
  8/263 Test   #8: test_array_opencl .....................................***Failed    0.14 sec
        Start   9: test_arrayio_cpu
  9/263 Test   #9: test_arrayio_cpu ......................................   Passed    0.01 sec
        Start  10: test_arrayio_opencl
 10/263 Test  #10: test_arrayio_opencl ...................................***Failed    0.05 sec
        Start  11: test_assign_cpu
 11/263 Test  #11: test_assign_cpu .......................................   Passed    0.29 sec
        Start  12: test_assign_opencl
 12/263 Test  #12: test_assign_opencl ....................................***Failed    0.31 sec

Reproducible Code and/or Steps

sudo yum install cmake3 boost-devel fftw-devel openblas-devel
clone arrayfire repository
mkdir build
cd build
cmake3 .. -DCMAKE_BUILD_TYPE=Release
make -j8
make test

System Information

  1. ArrayFire version master
  2. Devices installed on the system (Xilinx Virtex UltraScale+ VU9P)
  3. Output from the following scripts:

Run one of the following commands based on your OS

Linux:

(base) [ec2-user@ip-172-31-19-49 build]$ lsb_release -a
fi
if command -v clinfo > /dev/null; then
  clinfo
else
  echo "clinfo not found."
fiLSB Version:  :core-4.1-amd64:core-4.1-noarch
Distributor ID: Amazon
Description:    Amazon Linux release 2 (Karoo)
Release:        2
Codename:       Karoo
(base) [ec2-user@ip-172-31-19-49 build]$ if command -v nvidia-smi >/dev/null; then
>   nvidia-smi --query-gpu="name,memory.total,driver_version" --format=csv -i 0
> else
>   echo "nvidia-smi not found"
> fi
nvidia-smi not found
(base) [ec2-user@ip-172-31-19-49 build]$ if command -v /opt/rocm/bin/rocm-smi >/dev/null; then
>   /opt/rocm/bin/rocm-smi --showproductname
> else
>   echo "rocm-smi not found."
> fi
rocm-smi not found.
(base) [ec2-user@ip-172-31-19-49 build]$ if command -v clinfo > /dev/null; then
>   clinfo
> else
>   echo "clinfo not found."
> fi
clinfo not found.

Checklist

  • [ x] Using the latest available ArrayFire release
  • [ NA] GPU drivers are up to date

closed time in 25 days

brianmcconnel

issue commentarrayfire/arrayfire

Failed Tests when Compliling Against OpenCL on AWS f1.2xlarge Instance

I remember at one point I could compile with source on the Xilinx platform but I don't think it was usable. I don't think its something we will support.

brianmcconnel

comment created time in 25 days

issue closedCNugteren/CLBlast

DGEMM failures on Turing GPUs

I am getting DGEMM failures on Turing GPUs. I have tested on the NVIDIA T4 and NVIDIA T2000. Here are the logs from my system:

* Running on OpenCL device 'Quadro T2000'.
* Starting tests for the 'DGEMM' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for '101 (row-major) 111 (regular) 111 (regular)':
   XXXXXXXX----XXXX---X---X-------XXXXXXXXX----XXXX---X---X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=64 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.39%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 61.22%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.22%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.35%: m=64 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.52%: m=64 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.52%: m=64 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 74.26%: m=64 n=7 k=64 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 72.87%: m=64 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 72.00%: m=64 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 78.74%: m=64 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 79.87%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 72.70%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 34 skipped / 30 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular) 112 (transposed)':
   XXXXXXXX------XX-X-X-X-X-------XXXXXXXXX------XX-X-X-X-X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 66.96%: m=7 n=64 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 50.09%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 53.04%: m=7 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 44.52%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.22%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.52%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 75.68%: m=64 n=64 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 77.22%: m=64 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 77.24%: m=64 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 75.73%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 82.19%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 34 skipped / 30 failed
* Testing 'regular behaviour' for '101 (row-major) 112 (transposed) 111 (regular)':
   XXXXXXXXXXXXXXXX---X---X---X---X----XXXX----XXXX-------X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 50.26%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 47.30%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 53.04%: m=7 n=64 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.52%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.04%: m=64 n=7 k=64 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 74.43%: m=64 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 74.95%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 83.31%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 34 skipped / 30 failed
* Testing 'regular behaviour' for '101 (row-major) 112 (transposed) 112 (transposed)':
   XXXXXXXX--XX--XX-X-X-X-X---X---X----XXXX------XX-----X-X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 66.61%: m=7 n=64 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 55.83%: m=7 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 52.87%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 39.13%: m=7 n=64 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 44.70%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.35%: m=64 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 75.48%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 74.50%: m=64 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 77.15%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 82.50%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 37 skipped / 27 failed
* Testing 'regular behaviour' for '102 (col-major) 111 (regular) 111 (regular)':
   XXXXXXXX--XX--XXXXXXXXXX--XX--XX-----X-X-------X-----X-X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 53.04%: m=7 n=64 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 50.26%: m=7 n=64 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 61.39%: m=7 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 53.04%: m=7 n=64 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 49.91%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 44.35%: m=7 n=64 k=64 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.39%: m=7 n=64 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 44.52%: m=7 n=64 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 39.13%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.52%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 75.63%: m=64 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 79.14%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 79.56%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 34 skipped / 30 failed
* Testing 'regular behaviour' for '102 (col-major) 111 (regular) 112 (transposed)':
   XXXXXXXXXXXXXXXX--XX--XX--XX--XX-----X-X-----X-X-------X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 61.04%: m=7 n=64 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.39%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 58.61%: m=7 n=64 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 47.30%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 44.70%: m=7 n=64 k=64 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 50.26%: m=7 n=64 k=64 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 36.35%: m=7 n=64 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.39%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.22%: m=64 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 73.04%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 71.56%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 79.54%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 34 skipped / 30 failed
* Testing 'regular behaviour' for '102 (col-major) 112 (transposed) 111 (regular)':
   XXXXXXXX------XXXXXXXXXX------XX-X-X-X-X-------X-X-X-X-X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 66.96%: m=7 n=64 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.22%: m=7 n=64 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 44.70%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 55.83%: m=7 n=64 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 64.17%: m=7 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 53.04%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 33.57%: m=7 n=64 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 50.09%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 74.43%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 75.66%: m=64 n=64 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 85.51%: m=64 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 78.76%: m=64 n=64 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 79.92%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 85.58%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 34 skipped / 30 failed
* Testing 'regular behaviour' for '102 (col-major) 112 (transposed) 112 (transposed)':
   XXXXXXXX----XXXX--XX--XX------XX-X-X-X-X-----X-X---X---X-------X
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 23.86%: m=7 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=7 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 20.45%: m=7 n=7 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 24.43%: m=7 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 66.09%: m=7 n=64 k=7 lda=7 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 50.09%: m=7 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 41.91%: m=7 n=64 k=7 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 58.61%: m=7 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 53.04%: m=7 n=64 k=64 lda=64 ldb=64 ldc=7 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 58.26%: m=7 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.87%: m=64 n=7 k=7 lda=7 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.52%: m=64 n=7 k=7 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 76.70%: m=64 n=7 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 72.00%: m=64 n=7 k=64 lda=64 ldb=7 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 72.00%: m=64 n=7 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 74.59%: m=64 n=64 k=7 lda=7 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 79.16%: m=64 n=64 k=7 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Error rate 79.56%: m=64 n=64 k=64 lda=64 ldb=64 ldc=64 offa=0 offb=0 offc=0 alpha=3.14 beta=3.14 
   Pass rate   0.0%: 0 passed / 37 skipped / 27 failed
* Completed all test-cases for this routine. Results:
   0 test(s) passed
   278 test(s) skipped
   234 test(s) failed
 --- OpenCL device naming:
* Device type                   GPU
* Device name                   Quadro T2000
* Platform vendor               NVIDIA Corporation
* Platform version              OpenCL 3.0 CUDA 11.4.94

 --- CLBlast device naming:
* Device type                   GPU
* Device name                   Quadro T2000
* Device vendor                 NVIDIA
* Device architecture           SM7.5

 --- OpenCL device properties:
* Max work group size           1024
* Max work item dimensions      3
* - Max work item size #0       1024
* - Max work item size #1       1024
* - Max work item size #2       64
* Local memory size             49152KB
* Extensions:
cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info

 --- Some OpenCL library benchmarks (functions from clpp11.h):
* queue.GetContext()            0.0001 ms
* queue.GetDevice()             0.0000 ms
* device.Name()                 0.0746 ms
* device.Vendor()               0.0001 ms
* device.Version()              0.0005 ms
* device.Platform()             0.0000 ms
* Buffer<float>(context, 1024)  0.0005 ms
umar@gus /d/d/C/build (master)> uname -a
Linux gus 5.13.7-arch1-1 #1 SMP PREEMPT Sat, 31 Jul 2021 13:18:52 +0000 x86_64 GNU/Linux

closed time in 25 days

umar456

issue commentCNugteren/CLBlast

DGEMM failures on Turing GPUs

Yeah, Looks like that fixed it. The A100 is failing but I think we are also getting intermittent failures in other PRs as well. I can investigate it probably end of next week. I will close this issue.

umar456

comment created time in 25 days

create barnchumar456/arrayfire

branch : moddims_jit

created branch time in a month

issue commentCNugteren/CLBlast

DGEMM failures on Turing GPUs

Yes, I changed the parameters of the T4 to use the T2000 parameters and the tests passed. I can send you the exact diff of the repository to make it work. My concern is that the database.json file would still have incorrect values and would need to be adjusted to avoid errors when you update the parameters next time. I didn't look into the structure of that file to make the relevant changes.

umar456

comment created time in a month

issue commentCNugteren/CLBlast

DGEMM failures on Turing GPUs

Here are the update json files for the T4. They seem to be different from the parameters in the database. I am not sure why that would be the case. T4.zip

umar456

comment created time in a month

issue commentCNugteren/CLBlast

DGEMM failures on Turing GPUs

I tried to run the tuners on the T4 several times and they are not able to find stable parameters. Looks like they are failing for the smaller matrices based on the logs. I tried replacing the parameters from the T2000 to the T4 in the src/database/kernels/xgemm/xgemm_64.hpp and it seems to work. I have noticed that the final parameters are very different even though the architecture is similar. I am not sure if the tests the tuners are performing are exhaustive so maybe its picking parameters that work in the tests the tuners run but not for other sizes. I am not sure.

umar456

comment created time in a month

issue commentCNugteren/CLBlast

DGEMM failures on Turing GPUs

Hmm. I ran all tests before I posted the tuning parameters. I am not sure why they are failing now. I can look into this later today.

umar456

comment created time in a month

pull request commentCNugteren/CLBlast

New tuning results for 1 Intel CPU and 5 NVIDIA GPUs

I think it would be a good idea to mention limiting the clock speeds of the processors before performing the tuning tests. When I run the tuners on the CPUs and GPUs I run the following commands

Set the CPU frequency

sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -u 3100

Set NVIDIA GPU frequency

sudo nvidia-smi -i <device id> -lgc <clock-speed>

You can get the possible frequencies for your NVIDIA GPU using the following command:

sudo nvidia-smi -i <device id> --query-supported-clocks=gr --format=csv

I suggest that you pick a clock speed that would be stable. Somewhere in the middle of the range of frequencies listed above.

CNugteren

comment created time in a month

push eventumar456/arrayfire

Umar Arshad

commit sha a2426ca974d1b022859d026ad2d12ff0228ffe6e

Use newer version of FindOpenMP.cmake

view details

push time in a month

issue commentCNugteren/CLBlast

New tuning results

CNugteren

comment created time in a month

PR closed arrayfire/arrayfire

Use older CLBlast to avoid DGEMM errors on Turing. Update FindOpenMP

This PR updates the tag for the CLBlast version to avoid failures in DGEMM on Turing GPUs. Also update the FindOpenMP the version found in CMake 3.21

Description

  • Revert CLBlast to an older commit
  • Update FindOpenMP.cmake

Changes to Users

None

Checklist

<!-- Check if done or not applicable -->

  • [x] Rebased on latest master
  • [x] Code compiles
  • [x] Tests pass
  • ~[ ] Functions added to unified API~
  • ~[ ] Functions documented~
+1214 -123

0 comment

5 changed files

umar456

pr closed time in a month

issue commentCNugteren/CLBlast

DGEMM failures on Turing GPUs

Okay. I reran all the tuners and made sure that the CPU and GPU clocks were stable while they were running. I was able to get some valid parameters to run the tests with the correct results. I will upload them to the #1 issue.

umar456

comment created time in a month

PR opened arrayfire/arrayfire

Use older CLBlast to avoid DGEMM errors on Turing. Update FindOpenMP

This PR updates the tag for the CLBlast version to avoid failures in DGEMM on Turing GPUs. Also update the FindOpenMP the version found in CMake 3.21

Description

  • Revert CLBlast to an older commit
  • Update FindOpenMP.cmake

Changes to Users

None

Checklist

<!-- Check if done or not applicable -->

  • [x] Rebased on latest master
  • [x] Code compiles
  • [x] Tests pass
  • ~[ ] Functions added to unified API~
  • ~[ ] Functions documented~
+1214 -123

0 comment

5 changed files

pr created time in a month

create barnchumar456/arrayfire

branch : clblast_openmp

created branch time in a month

push eventumar456/arrayfire

Umar Arshad

commit sha 9f38a342fb967e3820bede0d164ccafcee46ec82

Fix reference count if array used in JIT operations. Previously when an af::array was used in a jit operation and it was backed by a buffer, a buffer node was created and the internal shared_ptr was stored in the Array for future use and returned when getNode was called. This increased the reference count of the internal buffer. This reference count never decreased because of the internal reference to the shared_ptr. This commit changes this behavior by createing new buffer nodes for each call the getNode. We use the new hash function to ensure the equality of the buffer node when the jit code is generated. This avoids holding the call_once flag in the buffer object and simplifies the management of the buffer node objects. Additionally when a jit node goes out of scope the reference count decrements as expected.

view details

push time in a month

push eventumar456/arrayfire

Umar Arshad

commit sha f6b06b72c53162a3863d5dc54637c793b3616eec

Fix canny by resizing the sigma array to the correct size The otsuThreshold function was creating an empty Array for the sigmas variable and this sometimes failed because the last value was not always written to. This commit adjusts the size of the sigmas array to better match the values that are assigned to it

view details

pradeep

commit sha e7f000d9bde36c27a3b7f540f164a9e7687d55f1

Fix edgeTracking CPU kernel to handle batch support Prior to this change, edge tracking CPU backend kernel wasn't processing the batch input sets. Thus, the output of corresponding input sets was missing in the array returned by canny API. This is fixed now. Added a batch test for this scenario.

view details

pradeep

commit sha 4ea695f9f4a0bcdeddfc9ee0b72b24b30f6c29a8

Improve canny's otsu helper by precomputing some arrays Co-authored-by: Umar Arshad <umar@arrayfire.com>

view details

Umar Arshad

commit sha 53088f6420e4a7978f34ec23979ba71c9218345c

Add ASSERT_REF to check for reference counts of af::arrays

view details

Umar Arshad

commit sha 063cd01292dad12cd5df54eae624e2708e34f325

Move createBinaryNode to common

view details

Umar Arshad

commit sha aaf92fc823f1912fd9ba5c442f68a1b3242c4bd5

Move cast and castArray to the common directory

view details

Umar Arshad

commit sha ce74c8e20109554d7c26a0c9139a8518fdbdcaab

Use getArray instead of castArray if types are the same in arithOp

view details

Umar Arshad

commit sha 6c8f0eaabfa7324206de0d0297edc71723cfe844

Create a hash function for Node objects for NodeMap The Node_map_t unordered_map object uses the pointer of the nodes for the key. This worked because you could previously because the node buffer objects tracked the buffer object's shared pointer. This required holding an additional reference to the buffer object when an Array was used in a JIT operation. This did not leak memory because both the buffer and the node were deleted when the Array object was destroyed. This commit creates a new hash function for the node pointers which dereferences the Node pointers and if they are buffers, it checks the buffer's pointer and its offset to determine if its unique. This approach allows us to remove the call_once construct from the setData member function of the buffer node. You can now create node objects for each invocation getNode function.

view details

Umar Arshad

commit sha 319c4110c9e3ca2e22d3d4a9670220a992f376a6

Fix reference count if array used in JIT operations. Previously when an af::array was used in a jit operation and it was backed by a buffer, a buffer node was created and the internal shared_ptr was stored in the Array for future use and returned when getNode was called. This increased the reference count of the internal buffer. This reference count never decreased because of the internal reference to the shared_ptr. This commit changes this behavior by createing new buffer nodes for each call the getNode. We use the new hash function to ensure the equality of the buffer node when the jit code is generated. This avoids holding the call_once flag in the buffer object and simplifies the management of the buffer node objects. Additionally when a jit node goes out of scope the reference count decrements as expected.

view details

push time in a month

push eventarrayfire/arrayfire

Umar Arshad

commit sha f6b06b72c53162a3863d5dc54637c793b3616eec

Fix canny by resizing the sigma array to the correct size The otsuThreshold function was creating an empty Array for the sigmas variable and this sometimes failed because the last value was not always written to. This commit adjusts the size of the sigmas array to better match the values that are assigned to it

view details

pradeep

commit sha e7f000d9bde36c27a3b7f540f164a9e7687d55f1

Fix edgeTracking CPU kernel to handle batch support Prior to this change, edge tracking CPU backend kernel wasn't processing the batch input sets. Thus, the output of corresponding input sets was missing in the array returned by canny API. This is fixed now. Added a batch test for this scenario.

view details

pradeep

commit sha 4ea695f9f4a0bcdeddfc9ee0b72b24b30f6c29a8

Improve canny's otsu helper by precomputing some arrays Co-authored-by: Umar Arshad <umar@arrayfire.com>

view details

push time in a month

PR merged arrayfire/arrayfire

Canny empty array fix

The otsuThreshold function was creating an empty Array for the sigmas variable and this sometimes failed because the last value was not always written to. This PR adjusts the size of the array so that the extra element is not used.

Description

The otsuThreshold function was creating an empty Array for the sigmas variable and this sometimes failed because the last value was not always written to. This PR adjusts the size of the array so that the extra element is not used.

This PR also removes a couple of reductions in the otsuThreshold function. These reductions are replaced by a scan.

Changes to Users

<!--

  • Additional options added to the build.
  • What changes will existing users have to make to their code or build steps? Refer to wiki for development guidelines -->

Checklist

<!-- Check if done or not applicable -->

  • [x] Rebased on latest master
  • [x] Code compiles
  • [x] Tests pass
  • ~[ ] Functions added to unified API~
  • [x] Functions documented
+113 -75

0 comment

3 changed files

umar456

pr closed time in a month

push eventumar456/arrayfire

Umar Arshad

commit sha 1b9536668d27c25929d5da52feaaa3907f8fba10

Create ASSERT_IMAGE_NEAR which compares two images for equality Add an image comparison assertion to the tests that compares two images and if there is an error, uploads the result and the gold image to CDash for comparison. Useful for when image tests fail

view details

pradeep

commit sha 7ddf462fd8ac3e80ac665d490602d9e8cec4c9be

Improve Readme (#3168) * Update README's: Prelude, Acknowledgement, Citations & Copyright Sections Increase image size Co-authored-by: John Melonakos <john@arrayfire.com> Co-authored-by: syurkevi <stefan@arrayfire.com> Co-authored-by: Umar Arshad <umar@arrayfire.com>

view details

Umar Arshad

commit sha 9ae8b7c8cd7885b296150a0af5ee3075cbc9c45d

Fix canny by resizing the sigma array to the correct size The otsuThreshold function was creating an empty Array for the sigmas variable and this sometimes failed because the last value was not always written to. This commit adjusts the size of the sigmas array to better match the values that are assigned to it

view details

pradeep

commit sha 738cb277c3ad2ea8ab969a284642988d707430f0

Fix edgeTracking CPU kernel to handle batch support Prior to this change, edge tracking CPU backend kernel wasn't processing the batch input sets. Thus, the output of corresponding input sets was missing in the array returned by canny API. This is fixed now. Added a batch test for this scenario.

view details

pradeep

commit sha 75bc1d54e92683cb83890d6a779c0362edba3af7

Improve canny's otsu helper by precomputing some arrays Co-authored-by: Umar Arshad <umar@arrayfire.com>

view details

Umar Arshad

commit sha 95d11c3c1d19373cbdc4a2326808361abedc7fa6

Add ASSERT_REF to check for reference counts of af::arrays

view details

Umar Arshad

commit sha bfb3557c3d466c62530c87a13e2a974adc2067c2

Move createBinaryNode to common

view details

Umar Arshad

commit sha a1de6ac997b153919917a5bf0c38c05e4907c56a

Move cast and castArray to the common directory

view details

Umar Arshad

commit sha da6518ead408d1a10798597333b7ff923f88941d

Use getArray instead of castArray if types are the same in arithOp

view details

Umar Arshad

commit sha 0465a6b47097158fd42e00b6bc34b4d3ad44ebf8

Create a hash function for Node objects for NodeMap The Node_map_t unordered_map object uses the pointer of the nodes for the key. This worked because you could previously because the node buffer objects tracked the buffer object's shared pointer. This required holding an additional reference to the buffer object when an Array was used in a JIT operation. This did not leak memory because both the buffer and the node were deleted when the Array object was destroyed. This commit creates a new hash function for the node pointers which dereferences the Node pointers and if they are buffers, it checks the buffer's pointer and its offset to determine if its unique. This approach allows us to remove the call_once construct from the setData member function of the buffer node. You can now create node objects for each invocation getNode function.

view details

Umar Arshad

commit sha 5e4d0344935629a0e3a796f47c293e58bf33385b

Fix reference count if array used in JIT operations. Previously when an af::array was used in a jit operation and it was backed by a buffer, a buffer node was created and the internal shared_ptr was stored in the Array for future use and returned when getNode was called. This increased the reference count of the internal buffer. This reference count never decreased because of the internal reference to the shared_ptr. This commit changes this behavior by createing new buffer nodes for each call the getNode. We use the new hash function to ensure the equality of the buffer node when the jit code is generated. This avoids holding the call_once flag in the buffer object and simplifies the management of the buffer node objects. Additionally when a jit node goes out of scope the reference count decrements as expected.

view details

push time in a month

push eventarrayfire/arrayfire

pradeep

commit sha 7ddf462fd8ac3e80ac665d490602d9e8cec4c9be

Improve Readme (#3168) * Update README's: Prelude, Acknowledgement, Citations & Copyright Sections Increase image size Co-authored-by: John Melonakos <john@arrayfire.com> Co-authored-by: syurkevi <stefan@arrayfire.com> Co-authored-by: Umar Arshad <umar@arrayfire.com>

view details

push time in a month

PR merged arrayfire/arrayfire

Improve Readme Documentation
+140 -116

0 comment

1 changed file

9prady9

pr closed time in a month

PullRequestReviewEvent

push event9prady9/arrayfire

Umar Arshad

commit sha b6049e74f0cf3eeb248471b8f8b0cd6e70e05dd5

Move code of conduct down. Add usage functions to example. Image++ Increase image size

view details

push time in a month

push event9prady9/arrayfire

Umar Arshad

commit sha b9ca6993e1debec46cdf67afa2a94c306feec607

Move code of conduct down. Add usage functions to example. Image++ Increase image size

view details

push time in a month

push event9prady9/arrayfire

Umar Arshad

commit sha 80fb0be466b0ae608acec767412eb6b8eb820806

Move code of conduct down. Add usage functions to example. Image++ Increase image size

view details

push time in a month