profile
viewpoint

jianlijianli/kissfft 0

a Fast Fourier Transform (FFT) library that tries to Keep it Simple, Stupid

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 TfLiteStatus SigmoidEval(TfLiteContext* context, TfLiteNode* node) {     }     case kTfLiteInt16: {       LogisticParams params;-      if (kernel_type == kReference) {-        reference_ops::Logistic(-            params, GetTensorShape(input), GetTensorData<int16_t>(input),-            GetTensorShape(output), GetTensorData<int16_t>(output));-      } else {+      if (kernel_type == kFixedPointOptimized) {         optimized_ops::Logistic(             params, GetTensorShape(input), GetTensorData<int16_t>(input),             GetTensorShape(output), GetTensorData<int16_t>(output));+      } else {+        EvalUsingLookupTableSigmoid16Bit(data, input, output);

Thanks for understanding the restrictions.

We all appreciate the improved accuracy this PR brings. What you mentioned works and please go ahead with the plan (either here or in another PR, whichever you feel comfortable with).

Thanks!

wwwind

comment created time in 4 days

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 TfLiteStatus SigmoidEval(TfLiteContext* context, TfLiteNode* node) {     }     case kTfLiteInt16: {       LogisticParams params;-      if (kernel_type == kReference) {-        reference_ops::Logistic(-            params, GetTensorShape(input), GetTensorData<int16_t>(input),-            GetTensorShape(output), GetTensorData<int16_t>(output));-      } else {+      if (kernel_type == kFixedPointOptimized) {         optimized_ops::Logistic(             params, GetTensorShape(input), GetTensorData<int16_t>(input),             GetTensorShape(output), GetTensorData<int16_t>(output));+      } else {+        EvalUsingLookupTableSigmoid16Bit(data, input, output);

Thanks for understanding the restrictions.

We all appreciate the improved accuracy this PR brings. What you mentioned works and please go ahead with the plan (either here or in another PR, whichever you feel comfortable with).

Thanks!

wwwind

comment created time in 4 days

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 struct OpData {   int input_left_shift = 0;   int32_t input_range_radius = 0;   int diff_min = 0;-  uint8_t table[256] = {0};+  uint16_t table[256] = {0};

Thanks for making the change.

wwwind

comment created time in 6 days

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 TfLiteStatus SigmoidEval(TfLiteContext* context, TfLiteNode* node) {     }     case kTfLiteInt16: {       LogisticParams params;-      if (kernel_type == kReference) {-        reference_ops::Logistic(-            params, GetTensorShape(input), GetTensorData<int16_t>(input),-            GetTensorShape(output), GetTensorData<int16_t>(output));-      } else {+      if (kernel_type == kFixedPointOptimized) {         optimized_ops::Logistic(             params, GetTensorShape(input), GetTensorData<int16_t>(input),             GetTensorShape(output), GetTensorData<int16_t>(output));+      } else {+        EvalUsingLookupTableSigmoid16Bit(data, input, output);

Thanks Anton. I see your point and that makes a lot of sense. The only concern I have is that there are already model that runs with 16bit builtin Tanh/Sigmoid Op. This change will make it more accurate but in the same time slower, which can be an issue.

May I suggest the following? We can submit the code but only enable it in the reference case. Once the optimization is finished, we can flip the switch so existing users can get better accuracy without performance impact.

Does that sound good? Thanks!

wwwind

comment created time in 6 days

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 TfLiteStatus SigmoidEval(TfLiteContext* context, TfLiteNode* node) {     }     case kTfLiteInt16: {       LogisticParams params;-      if (kernel_type == kReference) {-        reference_ops::Logistic(-            params, GetTensorShape(input), GetTensorData<int16_t>(input),-            GetTensorShape(output), GetTensorData<int16_t>(output));-      } else {+      if (kernel_type == kFixedPointOptimized) {         optimized_ops::Logistic(             params, GetTensorShape(input), GetTensorData<int16_t>(input),             GetTensorShape(output), GetTensorData<int16_t>(output));+      } else {+        EvalUsingLookupTableSigmoid16Bit(data, input, output);

Thanks, I was not in some of the discussions so I might miss context here. Isn't it safer to change just the reference code and leave the two optimized unchanged?

wwwind

comment created time in 7 days

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 TfLiteStatus SigmoidPrepare(TfLiteContext* context, TfLiteNode* node) {       PopulateLookupTable<int8_t>(data, input, output, [](float value) {         return 1.0f / (1.0f + std::exp(-value));       });+    } else if (input->type == kTfLiteInt16) {+      TF_LITE_ENSURE(context, output->params.scale == 1. / 32768);+      TF_LITE_ENSURE(context, output->params.zero_point == 0.);

I think integer "0" is sufficient here.

wwwind

comment created time in 11 days

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 TfLiteStatus SigmoidEval(TfLiteContext* context, TfLiteNode* node) {     }     case kTfLiteInt16: {       LogisticParams params;-      if (kernel_type == kReference) {-        reference_ops::Logistic(-            params, GetTensorShape(input), GetTensorData<int16_t>(input),-            GetTensorShape(output), GetTensorData<int16_t>(output));-      } else {+      if (kernel_type == kFixedPointOptimized) {         optimized_ops::Logistic(             params, GetTensorShape(input), GetTensorData<int16_t>(input),             GetTensorShape(output), GetTensorData<int16_t>(output));+      } else {+        EvalUsingLookupTableSigmoid16Bit(data, input, output);

Just to make sure I'm understanding this correctly, this changes TFLite's builtin sigmoid ("kGenericOptimized") from optimized_ops::Logistic() to "EvalUsingLookupTableSigmoid16Bit", which is more accurate. And "EvalUsingLookupTableSigmoid16Bit" is not vectorized yet (like in "EvalUsingLookupTable").

Will "EvalUsingLookupTableSigmoid16Bit" be optimized in the future?

wwwind

comment created time in 11 days

Pull request review commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

 struct OpData {   int input_left_shift = 0;   int32_t input_range_radius = 0;   int diff_min = 0;-  uint8_t table[256] = {0};+  uint16_t table[256] = {0};

Just to double check, is this safe for the int8/unit8 path when EvalUsingLookupTable is called with "table[0] = vld1q_u8_x4(data->table + 16 * 4 * 0)" ?

wwwind

comment created time in 11 days

Pull request review commenttensorflow/tensorflow

[TFLite int16] Symmetric 16-bit activations and 8-bit weights: reference kernel CONV_2D

 inline int32 MultiplyByQuantizedMultiplier(int32 x, int32 quantized_multiplier,                              right_shift); } +inline int32 MultiplyByQuantizedMultiplier(std::int64_t x,+                                           int32 quantized_multiplier,+                                           int shift) {+  // Inputs:+  // - quantized_multiplier has fixed point at bit 31+  // - shift is -31 to +7 (negative for right shift)+  //+  // Assumptions: The following input ranges are assumed+  // - quantize_scale>=0  (the usual range is (1<<30) to (1>>31)-1)+  // - scaling is chosen so final scaled result fits in int32+  // - input x is in the range -(1<<47) <= x < (1<<47)

Just for my understanding, what's the interaction between this range and the shift?

y = x * multiplier * 2 ^ (-31 + shift) and multiplier is [2^30, 2^31). If shift is as high as 7, multiplier * 2 ^ (-31 + shift) is within [2^6, 2^7). Multiplied to x, where -(1<<47) <= x < (1<<47), it seems possible to overflow the int32 output.

Did I miss anything? Very likely it works for real use cases (as shown in the test cases) but just want to double check. Thanks.

wwwind

comment created time in 19 days

Pull request review commenttensorflow/tensorflow

Add 16 bit support to kernel operator SOFTMAX

 inline void Softmax(const SoftmaxParams& params,   const int depth =       MatchingDim(input_shape, trailing_dim, output_shape, trailing_dim); +  int32 min_num = std::numeric_limits<T>::min();

Since we are using int32 to hold the min/max for "T", can we add a check that only int8 and int16 is supported in the kernel?

psunn

comment created time in a month

Pull request review commenttensorflow/tensorflow

Add 16 bit support to kernel operator SOFTMAX

 limitations under the License. namespace tflite { namespace reference_integer_ops { -// Quantized softmax with int8 input and output.+template<typename T> struct OutputParams {};+template<> struct OutputParams<int8> {+  static const int output_num_bits = 8;+  static const int32 output_min = std::numeric_limits<int8>::min();+};+template<> struct OutputParams<int16> {+  static const int output_num_bits = 16 - 1;

What is the output scale of the int16 softmax?

psunn

comment created time in a month

Pull request review commenttensorflow/tensorflow

Add 16 bit support to kernel operator SOFTMAX

 class BaseActivationsOpModel : public SingleOpModel {       output_ = AddOutput({input.type, {}, 0, 0, 1. / 256});     } else if (input.type == TensorType_INT8) {       output_ = AddOutput({TensorType_INT8, {}, 0, 0, 1. / 256, -128});+    } else if (input.type == TensorType_INT16) {+      output_ = AddOutput({TensorType_INT16, {}, 0, 0,+                           1.0f/(std::numeric_limits<int16_t>::max() + 1), 0});

Nit: may be just use 32768 to match the other branches?

psunn

comment created time in a month

pull request commenttensorflow/tensorflow

TANH/Sigmoid 16-bit activation functions using LUT

Cache aliasing also plays key role in the performance of LUT and that is really hardware dependent. Seems to be the best option is for them to "coexist", like Benoit mentioned in his comment.

wwwind

comment created time in 2 months

Pull request review commenttensorflow/tensorflow

[tflite] add int8 input/output to label_image

 extern bool input_floating; template <class T> void get_top_n(T* prediction, int prediction_size, size_t num_results,                float threshold, std::vector<std::pair<float, int>>* top_results,-               bool input_floating) {+               int input_type) {

Can we use an enum?

freedomtan

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[tflite] add int8 input/output to label_image

 extern bool input_floating; template <class T> void get_top_n(T* prediction, int prediction_size, size_t num_results,                float threshold, std::vector<std::pair<float, int>>* top_results,-               bool input_floating) {+               int input_type) {   // Will contain top N results in ascending order.   std::priority_queue<std::pair<float, int>, std::vector<std::pair<float, int>>,                       std::greater<std::pair<float, int>>>       top_result_pq;    const long count = prediction_size;  // NOLINT(runtime/int)+  float value = 0.0;+   for (int i = 0; i < count; ++i) {-    float value;-    if (input_floating)-      value = prediction[i];-    else-      value = prediction[i] / 255.0;+    switch (input_type) {+      case kTfLiteFloat32:+        value = prediction[i];+        break;+      case kTfLiteInt8:+        value = (prediction[i] + 128) / 256.0;

Why is this 256 while the uint8 case is 255?

freedomtan

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

[tflite] add int8 input/output to label_image

 void resize(T* out, uint8_t* in, int image_height, int image_width,   auto output_number_of_pixels = wanted_height * wanted_width * wanted_channels;    for (int i = 0; i < output_number_of_pixels; i++) {-    if (s->input_floating)-      out[i] = (output[i] - s->input_mean) / s->input_std;-    else-      out[i] = (uint8_t)output[i];+    switch (s->input_type) {+      case kTfLiteFloat32:+        out[i] = (output[i] - s->input_mean) / s->input_std;+        break;+      case kTfLiteInt8:+        out[i] = output[i] - 128;+        break;+      case kTfLiteUInt8:+        out[i] = output[i];

Should we keep the explicit cast?

freedomtan

comment created time in 3 months

issue commenttensorflow/tensorflow

how to build tensorflow lite static library for android use ndk?

Hi mlinxiang, thanks for using tensorflow lite.

For your use case, could you please try to follow https://www.tensorflow.org/lite/guide/android (and https://github.com/tensorflow/examples/blob/master/lite/examples/image_classification/android/README.md)? That should create a standard image classification model.

Thanks.

mlinxiang

comment created time in 3 months

issue commenttensorflow/tensorflow

TensorFlow Lite schema updater loses floating-point precision

Hi gargn@, could you please help answer this question, it's about conversion between different versions of the schema. Thanks.

vmarkovtsev

comment created time in 3 months

issue commenttensorflow/tensorflow

Concat within tf.range loop throwing error when converting to tflite: StatelessWhile custom implementation needed

Adding haozha111 to triage the issue since this involves converting loops. Thanks.

m-ciniello

comment created time in 3 months

pull request commenttensorflow/tensorflow

Fixed all the warning in the flow for TC match_dilated_convolution_test

The change looks good to me, adding @impjdi to double check.

amitsrivastava78

comment created time in 3 months

pull request commenttensorflow/tensorflow

Lite: Slice Op refactored

Could you please add a high level description what this CL does? Thanks.

ANSHUMAN87

comment created time in 3 months

Pull request review commenttensorflow/tensorflow

TFLite: Div op Neon optimization

 inline void BroadcastMulDispatch(                        input2_data, output_shape, output_data); } +inline void Div(const ArithmeticParams& params,+                const RuntimeShape& input1_shape, const float* input1_data,+                const RuntimeShape& input2_shape, const float* input2_data,+                const RuntimeShape& output_shape, float* output_data) {+  gemmlowp::ScopedProfilingLabel label("Div");+  const float output_activation_min = params.float_activation_min;+  const float output_activation_max = params.float_activation_max;++  int i = 0;+  const int size = MatchingFlatSize(input1_shape, input2_shape, output_shape);+#ifdef USE_NEON+  static constexpr int kNewtonSteps = 2;

Could you please add a comment on this parameter? How is it used and what is the accuracy implication if kNewtonStep is 2?

mwtarnowski

comment created time in 3 months

pull request commenttensorflow/tensorflow

Implement reference kernel for Softmax using CMSIS-NN

Thanks for the contribution. The main logic is similar to the 8bit quantized softmax code before we switched to look-up-tables ( https://github.com/tensorflow/tensorflow/commit/96b38db29c3ad3b1c1397f57bca85e3df3a1ac5f#diff-77415d2cb3521cc7859a3349382057ad).

Assigning to Pete who knows the micro code better.

giorgio-arenarm

comment created time in 4 months

more