profile
viewpoint

KhronosGroup/glTF 3698

glTF – Runtime 3D Asset Delivery

KhronosGroup/glslang 1384

Khronos-reference front end for GLSL/ESSL, partial front end for HLSL, and a SPIR-V generator.

google/gemmlowp 1177

Low-precision matrix multiplication

KhronosGroup/KTX-Software 184

KTX (Khronos Texture) Library and Tools

KhronosGroup/siggraph2012course 50

Presentations for SIGGRAPH 2012 course "Graphics Programming on the Web" covering HTML5 technologies (Canvas, CSS, etc.), WebGL and WebCL

KhronosGroup/SPIR-Tools 35

SPIR-Tools

KhronosGroup/COLLADA-CTS 28

Welcome to the COLLADA Conformance Test Suite

bjacob/webgl-perf-tests 3

WebGL performance regression tests

issue commenttensorflow/tensorflow

Android TFlite inconsistent performance when app is not in focus.

Regarding the previous 3 comments, the explanation is that tflite uses the ruy thread-pool directly in place like this CustomGemv, via the cpu_backend_threadpool facility: https://github.com/tensorflow/tensorflow/blob/9c7d78bce12a6374748ce84dd9a4fc0a37606775/tensorflow/lite/kernels/cpu_backend_gemm_custom_gemv.h#L179

When tflite_with_ruy=true (the default on ARM), cpu_backend_threadpool maps onto the ruy thread-pool, https://github.com/tensorflow/tensorflow/blob/9c7d78bce12a6374748ce84dd9a4fc0a37606775/tensorflow/lite/kernels/cpu_backend_threadpool.h#L32-L41

Which ends up in the wait.cc code that I linked above.

So as far as I can see, everything here is still perfectly consistent with the theory proposed above, and it would be very useful to invest in the above described project to expose direct control over the spin duration to the application, with a clear explanation of the compromises here, and some degree of acknowledgement that there is a fundamental conflict here between a mobile OS conserving battery charge, and allowing background activities to have as low latency as foreground activities.

Artaches

comment created time in 22 days

MemberEvent
MemberEvent
MemberEvent

issue commenttensorflow/tensorflow

Android TFlite inconsistent performance when app is not in focus.

Thanks for running this experiment. ~1.6x is sufficiently close to ~1x that we can draw the following conclusion as to what is the main phenomenon happening here --- there may well be additional things at play, but this should be the main thing, especially if you find that further increading this duration beyond 20ms further reduces the inference times.

On a mobile system like Android, the OS aggressively throttles the computing power available to applications in order to preserve battery charge. The throttling is even more aggressive for applications that aren't foreground. So as a first approximation you could say that everything here is "working as intended" from the perspective of the OS.

In multithreaded inference, TensorFlow Lite uses internal a pool of "worker" threads. After a thread is done with its current task, it has to wait for a next task to come up. The standard way of doing that, "condition-variable waiting", informs the OS about this thread being waiting. That's good because that allows to conserve battery charge. But Android is so aggressive at reducing CPU clock speeds and even turning CPU cores altogether, and then so conservative at ramping CPU cores up again, that if we enter that waiting state then there is a very large penalty for the next workload. It may take 100ms or more to reach back high clock speeds. That is why the above-linked wait.cc code uses some "spin-waiting", i.e. just running a loop until a certain number of milliseconds have elapsed, to avoid the worst of that issue.

The problem of course is that there is no ideal choice for this "spin duration", as it's a compromise between wasting battery charge and suffering very high latency penalties. The current 2ms value was empirically determined from a few benchmarks, but this depends not only on TFLite models (governing time gaps between matrix multiplications within the inference) but also on application-level usage patterns (governing time gaps between inferences). 2ms was a conservative choice to avoid wasting too much battery charge, but the gap between this 2ms on the one hand, and the 100ms order of magnitude of the penalty to ramp back up to high clock speed is so large, there will be lots of room for variation among applications as to what's the best choice here.

That is why I believe the best course of action here would be to make this time duration runtime-configurable:

  1. in ruy, make this configurable, presumably by exposing a control on ruy::Context.
  2. in TFLite, expose this through CpuBackendContext, then on the user-facing TFLite API (TfLiteInterpreter?)

I had filed an internal issue (b/135595069) - @jdduke @talumbau please consider staffing.

Artaches

comment created time in a month

issue commenttensorflow/tensorflow

Android TFlite inconsistent performance when app is not in focus.

Could you please try the following (With ruy. So in this experiment, do NOT build with --define=tflite_with_ruy=false):

Edit this line: https://github.com/tensorflow/tensorflow/blob/e6983d538f140552d18fff6377a80c45236f1e36/tensorflow/lite/experimental/ruy/wait.cc#L66

increase the number of milliseconds here, change the current value 2 to something like 20 instead. does this help?

Artaches

comment created time in a month

push eventgoogle/gemmlowp

Lamar

commit sha 2483d846ad865dd4190fe4a1a1ba2d9cfcea78e1

added template specialization for int8_t saturating add

view details

push time in a month

PR merged google/gemmlowp

Added the Fixedpoint Saturating Add Template Specialization

I wanted to use this in Tensorflow Lite but found out it was only implemented for Int16. I added the Int8 equivalent.

+12 -1

0 comment

1 changed file

lamarrr

pr closed time in a month

push eventgoogle/gemmlowp

Lamar

commit sha fcdb1b3f8a492b461e9e7b2458bdad3c2cd1a405

fixed doc comments for Mul and Sub

view details

push time in a month

PR merged google/gemmlowp

fixed doc comments for Mul and Sub

The doc comments were misplaced. And there was none for Mul.

+2 -1

0 comment

1 changed file

lamarrr

pr closed time in a month

issue commentgoogle/gemmlowp

Issues compiling for bare metal application

Does TFLite For Microcontrollers meet your needs? https://www.tensorflow.org/lite/microcontrollers/get_started Either way you should ask on a TFLite support channel - either if you overcame any issue with gemmlowp (just getting rid of a unistd.h include would be fairly easy), you would run into more issues with TFLite, which is much larger and has more OS dependencies. TFLite For Microcontrollers is meant to address this class of use cases.

wrd90

comment created time in a month

issue closedgoogle/gemmlowp

Issues compiling for bare metal application

Hi,

I am trying to compile TFLite for a bare metal application, and have run into issues with gemmlowp while doing so. For my target platform I do not have unistd.h, can anyone help me find a workaround?

closed time in a month

wrd90

pull request commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

can you share benchmark results showing how this compares to the current 16bit fixedpoint implementations on various ARM CPUs?

because pure arithmetic implementations scale linearly with SIMD widths and lookup tables do not, we have typically found that despite looking expensive in source code, pure arithmetic implemenations can actually perform well. As ARM CPUs SIMD width increases, they keep looking better. For example, on a Cortex-A76 CPU (e.g. Pixel4) we can issue 2 128-bit multiplications each cycle, each doing 8 16-bit fixed-point multiplications, so each fixed-point multiplication effectively only costs one sixteenth of a cycle.

as this varies greatly from one CPU to another, we don't expect one approach to consistently outperform the other (arithmetic vs LUT). Instead, we have started to let both coexist in source code. We still need to pick one to be used by default but at least having both in source code makes it easy to switch.

i don't remember the latest but some the the following people might: @renjie-liu @abattery @jianlijianli

wwwind

comment created time in 3 months

issue commentgoogle/gemmlowp

SIMD back-end for IBM Power and Z

TFLite has already switched to ruy on arm64. There is work underway on arm32 and x86. However given the complex landscape of inference backends at the moment it's hard to make guesses as to what tflite will end up using. Over the next few months these things should settle a bit.

geert56

comment created time in 3 months

issue commentgoogle/gemmlowp

SIMD back-end for IBM Power and Z

it'd be interesting to hear if there is interest from other users. feel free to use the gemmlowp 'google group' to reach more people. as far as google is concerned we are in the process of migrating to a successor of gemmlowp, named ruy, currently here as a subdirectory of tensorflow, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/ruy , but set to move to its own github tree in the near future. it's not 100% ready for wider contribution as documentation is missing but just as a heads-up.

geert56

comment created time in 3 months

CommitCommentEvent

push eventgoogle/gemmlowp

Benoit Jacob

commit sha 08c5dd30c8216848ad7e4b56c5fc437b6bb0578c

fix recent regression

view details

push time in 3 months

push eventgoogle/gemmlowp

Clement Courbet

commit sha e25e353b9a8342f21946cad161a7f3495dddcbae

StoreFinalOutputImpl::Run is reloading MatrixMap::data_ and MatrixMap::stride_ in a loop. The specialization for RegBlockUint8<8, 8> is the hottest one for our data based on profiling. Specialize it for MatrixMap and make a local copy of the destination so that the compiler can prove that data_ aliases neither &data_ nor &stride_. This makes small model & GoogLeNet GEMMs about 1% faster. Profiling shows that this mostly comes from "unpack to row-major" being faster: Before: gemmlowp profile (1 threads, 9469 samples) 94.31% gemmlowp::MultiThreadGemm 94.30% gemmlowp::SingleThreadGemm 73.97% compute 62.39% optimized kernel 11.57% other 8.73% pack LHS 6.51% unpack to column-major 4.95% unpack to row-major 0.13% pack RHS 0.01% other 0.01% other 5.69% other (outside of any label) After: 93.89% gemmlowp::MultiThreadGemm 93.89% gemmlowp::SingleThreadGemm 74.80% compute 61.82% optimized kernel 12.98% other 9.03% pack LHS 6.24% unpack to column-major 3.68% unpack to row-major 0.12% pack RHS 0.02% other 0.00% other 6.11% other (outside of any label)

view details

push time in 3 months

PR merged google/gemmlowp

StoreFinalOutputImpl::Run is reloading MatrixMap::data_ and

MatrixMap::stride_ in a loop.

The specialization for RegBlockUint8<8, 8> is the hottest one for our data based on profiling. Specialize it for MatrixMap and make a local copy of the destination so that the compiler can prove that data_ aliases neither &data_ nor &stride_.

This makes small model & GoogLeNet GEMMs about 1% faster.

Profiling shows that this mostly comes from "unpack to row-major" being faster:

Before: gemmlowp profile (1 threads, 9469 samples) 94.31% gemmlowp::MultiThreadGemm 94.30% gemmlowp::SingleThreadGemm 73.97% compute 62.39% optimized kernel 11.57% other 8.73% pack LHS 6.51% unpack to column-major 4.95% unpack to row-major 0.13% pack RHS 0.01% other 0.01% other 5.69% other (outside of any label)

After: 93.89% gemmlowp::MultiThreadGemm 93.89% gemmlowp::SingleThreadGemm 74.80% compute 61.82% optimized kernel 12.98% other 9.03% pack LHS 6.24% unpack to column-major 3.68% unpack to row-major 0.12% pack RHS 0.02% other 0.00% other 6.11% other (outside of any label)

+21 -0

0 comment

1 changed file

legrosbuffle

pr closed time in 3 months

issue closedgoogle/gemmlowp

How to calculate non-linear function after 8bit quantization ?

For 8bit quantization, zero point and scale are applied. But when in non-linear function layer, I want to know if I can process the input data without converting it to a real number. Or is there any way to calibrate? please answer about my question.

non-linear function : tanh / sigmoid / softmax / exp(x)

closed time in 3 months

youck9224

issue commentgoogle/gemmlowp

How to calculate non-linear function after 8bit quantization ?

gemmlowp provides a fixed-point computation library for this purpose. It contains implementations of exp, tanh and logistic functions. See: https://github.com/google/gemmlowp/blob/master/fixedpoint/fixedpoint.h

For Softmax, an implementation is found in TensorFlow Lite here, built on top of the aforementioned gemmlowp fixedpoint library: https://github.com/tensorflow/tensorflow/blob/305fec9fddc3bdb5bb574a134b955bf4b07fd795/tensorflow/lite/kernels/internal/reference/softmax.h#L60

youck9224

comment created time in 3 months

push eventgoogle/gemmlowp

Nick Kreeger

commit sha 6a007c79803e275966472caf96df58dd8e35b48b

Add some extra documentation to the magic numbers used in exp_on_negative_values().

view details

push time in 3 months

more