profile
viewpoint
Anidh Singh anidh @myelinfoundry-2019 Bengaluru https://www.linkedin.com/in/anidh Hey, I am a Product Engineer working for Myelin Foundry in India. I code in python developing various AI and DL solutions.

anidh/coursera-ml-python 0

This repository contains python implementations of all the exercises from the course by Andrew Ng. For most of the code in this repository I have instead used existing Python implementations like Scikit-learn but most of the places all the codes have been written from scratch.

anidh/deep-learning 0

A repository for all of my deep learning projects.

anidh/django-url-shortener 0

A Django web application which shortens a provided url and then redirects the user to the given url.

anidh/espcn-keras 0

This is a quantization aware training code for the ESPCN model implemented in keras using tensorflow model optimization toolkit.

anidh/hackerearth 0

A repository about all the questions on hackerearth

anidh/python-udemy 0

This is repository for all the codes done on Udemy course of python.

issue commenttensorflow/tensorflow

[TFLite] Failed to create Hexagon delegate on Oneplus 5

@anidh Yeah, I also found the report about the DSP vulnerability. So I always doubt if OnePlus lock it. I tried root my phone temporarily and run "setenforce 0" , but "adb shell root" didn't work and it showed "Permission denied". Did you mean root the phone permanently? Thanks, you help me so much.

I haven't tried temporary rooting, I rooted my phone permanently. Using the setenforce command it worked for me 1 time and didn't work for me after the reboot.

anidh

comment created time in 14 hours

issue commenttensorflow/tensorflow

[TFLite] Failed to create Hexagon delegate on Oneplus 5

@anidh Thanks for your reply. I have been working on this problem for a long time. Because I have installed a apk that always work well on DSP runtime and AIP runtime. But recently I suddenly found it didn't work. I use that app project in my master's degree paper. Thank you. I try on one SNPE SDK demo ( image-classifiers ), and the same error log happened always. I think it may be some permissions of Android 10 or .so files of system. I tried many solutions that didn't work either. I also sought help form OnePlus tech. Hope it will help. I decide to wait for OnePlus's reply. Thanks a lot.

I can say that we contacted Qualcomm regarding this issue and they said that it's upto OEM for locking of DSP. I think it's related to the DSP vulnerability found this year. I figured out a way to use DSP by rooting my phone and using the selinux policy. This gave me access back to DSP.

anidh

comment created time in 14 hours

issue commenttensorflow/tensorflow

[TFLite] Failed to create Hexagon delegate on Oneplus 5

@vikramambrose I have the same error log as you. I set minSdkVersion<= 22 but it still went wrong. I change build.gradle many times, but it didn't work. Android 10, OnePlus 7.Is any other difference you think may be wrong. May Android 10 didn't allow read DSP files? Your reply may be helpful for me. Thanks a lot.

I upgraded my OnePlus 5 to android 10 and I'm unable to access the DSP from then could be that issue. Will suggest to take a look at Snapdragon Neural Processing Engine and run basic examples there on DSP delegate and see if it works.

anidh

comment created time in 15 hours

issue closedtensorflow/tensorflow

Cannot see per operator profiling in android studio with tensorflow lite models

URL(s) with the issue:

Please provide a link to the documentation entry, for example:

https://www.tensorflow.org/lite/performance/measurement#adding_trace_events_in_java_code

https://www.tensorflow.org/lite/performance/images/as_traces.png

Description of issue (what needs changing):

I've added the trace sections to the code around the interpreter.run() command. When I start to profile the app I cannot see the per operator profiling as we can see in the second link.

Clear description

This will help us in seeing how much time is being taken by each layer in the tflite model and we can change our models according to that.

My Code

Trace.beginSection("runInference");
 tflite_sr.run(ybuffer,sr_y_channel);
 Trace.endSection(); 

closed time in a month

anidh

issue commenttensorflow/tensorflow

Cannot see per operator profiling in android studio with tensorflow lite models

Hi Anidh, I guess you may see a result like https://www.tensorflow.org/lite/performance/images/as_traces.png The result windows of CPU profile consists of two panes. The left one is timeline view of traces and the right one is analysis report. If you see the 'All threads' tab of analysis report, it shows the cumulative timing values. You can click each section of running in the left pane. For example, if you click one frame of the 'runInference' sections, there will be a 'runInference' tab on the right pane and it will show the report for only that frame including children sections such as tflite op invocations. Moreover, if you click another frame of 'runInference' sections, the analysis report will show the value for the new frame.

@hajuho

Thanks a lot. The steps you mentioned worked perfectly.

anidh

comment created time in a month

issue commenttensorflow/tensorflow

Cannot see per operator profiling in android studio with tensorflow lite models

Hi @hajuho ,

Thanks a lot for the quick reply.

Can you see 'runInference' trace while you cannot see per-operator profiling result? Yes I'm able to see the "runInference" trace when I'm doing a systrace.

Did you use org.tensorflow:tensorflow-lite:0.0.0-nightly for dependency? tensorflow-lite:2.3.0 or older stable aar libraries don't have this feature. I was using a stable build of the aar file. I changed it to implementation 'org.tensorflow:tensorflow-lite:0.0.0-nightly'

Is the nightly library is up-to-date version? In your app's build.gradle file add 'changing = true', to make sure the library is automatically refreshed. implementation('org.tensorflow:tensorflow-lite:0.0.0-nightly') { changing = true }

I was not using the nightly library and thus never used this flag. I have now included the flag.

Did you run the command 'adb shell setprop debug.tflite.trace 1' before you started profiler? FYI, this property value is removed if you reboot the device.

No, I was not using any such command and I have used this command now and it's now showing me all of the layers and timing of each and every layer.

I still have some doubt. The inference I did was for some 100 frames or such and continuously is the timing which is reported is cumulative or it is average over all of the runs. Moreover if I want to see the timing for only one of the frames is there nay method to do it for only one run?

Sorry I closed the issue by mistake and thus reopening the issue. Thanks a lot for your help.

anidh

comment created time in a month

IssuesEvent

issue closedtensorflow/tensorflow

Cannot see per operator profiling in android studio with tensorflow lite models

URL(s) with the issue:

Please provide a link to the documentation entry, for example:

https://www.tensorflow.org/lite/performance/measurement#adding_trace_events_in_java_code

https://www.tensorflow.org/lite/performance/images/as_traces.png

Description of issue (what needs changing):

I've added the trace sections to the code around the interpreter.run() command. When I start to profile the app I cannot see the per operator profiling as we can see in the second link.

Clear description

This will help us in seeing how much time is being taken by each layer in the tflite model and we can change our models according to that.

My Code

Trace.beginSection("runInference");
 tflite_sr.run(ybuffer,sr_y_channel);
 Trace.endSection(); 

closed time in a month

anidh

issue commenttensorflow/tensorflow

Cannot see per operator profiling in android studio with tensorflow lite models

  • adb shell setprop debug.tflite.trace 1 Hi @hajuho ,

Thanks a lot for the quick reply.

anidh

comment created time in a month

issue commenttensorflow/tensorflow

Cannot see per operator profiling in android studio with tensorflow lite models

Hi @amahendrakar , Thanks for replying. The code which i'm using to do the inference is


    //gpuDelegate = new GpuDelegate();
    //tfliteOptions_sr.addDelegate((gpuDelegate));
    //tfliteOptions_sr.setNumThreads(numThreads);

    ByteBuffer ybuffer = ByteBuffer.allocateDirect(i_prod).order(ByteOrder.nativeOrder());
    ybuffer.rewind();
    ybuffer.put(y_channel_inp);

    ByteBuffer out_channel = ByteBuffer.allocateDirect(o_prod).order(ByteOrder.nativeOrder());

    try {
      Trace.beginSection("recognizeImage");
      //long startTime = System.currentTimeMillis();
      Trace.beginSection("runInference");
      tflite_sr.run(inp,out_channel);
      Trace.endSection();

      //long stopTime = System.currentTimeMillis();
      //long elapsedTime = stopTime - startTime;
      //System.out.println("&&&&&&&&&&&&&&&&&&&&&&"+elapsedTime);
      Trace.endSection();

    }
    catch (Exception e) {
      Log.d(TAG, "Error in Inference: "+e);
    }
    return out_channel;
  } ```

This is the code which is used for the inference, the ip buffer is of int type. When i do the profiling on the mobile device I am not able to see the layer wise timing as what has been mentioned in the documentation.

Thanks,
Anidh
anidh

comment created time in 2 months

issue openedtensorflow/tensorflow

Cannot see per operator profiling in android studio with tensorflow lite models

URL(s) with the issue:

Please provide a link to the documentation entry, for example:

https://www.tensorflow.org/lite/performance/measurement#adding_trace_events_in_java_code

https://www.tensorflow.org/lite/performance/images/as_traces.png

Description of issue (what needs changing):

I've added the trace sections to the code around the interpreter.run() command. When I start to profile the app I cannot see the per operator profiling as we can see in the second link.

Clear description

This will help us in seeing how much time is being taken by each layer in the tflite model and we can change our models according to that.

My Code

      tflite_sr.run(ybuffer,sr_y_channel);
      Trace.endSection(); ```

created time in 2 months

issue closedtensorflow/tensorflow

Timing discrepancy between tflite benchmark tool and android

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Oneplus 5, Oppo F11 Pro
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 1.13.1
  • Python version: 3.7
  • Bazel version (if compiling from source): NA
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: RTX 2060 6 GB

Describe the current behavior

I have created a Super Resolution model using tf 1.13.1 and used the tflite benchmark tool to benchmark the timing on my android device. The timing which I'm getting is around 31-35ms by using the command adb shell taskset f0 /data/local/tmp/benchmark_model \ --graph=/data/local/tmp/main.tflite \ --enable_op_profiling=true --num_threads=4

Here is the output from the benchmark tool -

Duplicate flags: num_threads
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [/data/local/tmp/main.tflite]
Enable op profiling: [1]
#threads used for CPU inference: [4]
Loaded model /data/local/tmp/main.tflite
The input model file size (MB): 0.004448
Initialized session in 0.855ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=15 first=49060 curr=33682 min=32825 max=49060 avg=34825.1 std=3930

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=32891 curr=33978 min=32641 max=34005 avg=33209.4 std=450

Inference timings in us: Init: 855, First inference: 49060, Warmup (avg): 34825.1, Inference (avg): 33209.4
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=0 overall=12.8672
Profiling Info for Benchmark Initialization:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    0.091	    0.091	100.000%	100.000%	     0.000	        1	AllocateTensors/0

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    0.091	    0.091	100.000%	100.000%	     0.000	        1	AllocateTensors/0

Number of nodes executed: 1
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	         AllocateTensors	        1	     0.091	   100.000%	   100.000%	     0.000	        1

Timings (microseconds): count=1 curr=91
Memory (bytes): count=0
1 nodes observed

Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	              LAYER 0	            0.000	    0.215	    0.196	  0.591%	  0.591%	     0.000	        1	[name]:0
	              LAYER 1	            0.197	    1.824	    1.855	  5.590%	  6.181%	     0.000	        1	[name]:1
	              LAYER 2	            2.053	    4.071	    4.084	 12.306%	 18.487%	     0.000	        1	[name]:2
	              LAYER 3	            6.138	    1.711	    1.708	  5.147%	 23.634%	     0.000	        1	[name]:3
	              ACTIVATION	     7.848	    5.183	    5.190	 15.638%	 39.272%	     0.000	        1	[name]:4
	              LAYER	           13.039	    2.482	    2.464	  7.424%	 46.696%	     0.000	        1	[name]:5
	              LAYER	           15.505	    2.428	    2.466	  7.431%	 54.126%	     0.000	        1	[name]:6
	              LAYER	           17.972	    2.457	    2.431	  7.325%	 61.451%	     0.000	        1	[name]:7
	              LAYER	           20.404	    3.274	    3.220	  9.703%	 71.154%	     0.000	        1	[name]:8
	              LAYER	           23.626	    5.014	    5.039	 15.183%	 86.337%	     0.000	        1	[name]:9
	              LAYER	           28.666	    2.939	    3.242	  9.769%	 96.105%	     0.000	        1	[name]:10
	              LAYER	           31.910	    0.759	    0.772	  2.326%	 98.431%	     0.000	        1	[name]:11
	              LAYER	           32.684	    0.513	    0.521	  1.569%	100.000%	     0.000	        1	[y]:12

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         ACTIVATION	            7.848	    5.183	    5.190	 15.638%	 15.638%	     0.000	        1      [name]:4
	         ACTIVATION	           23.626	    5.014	    5.039	 15.183%	 30.821%	     0.000	        1	[name]:9
	                   LAYER	            2.053	    4.071	    4.084	 12.306%	 43.127%	     0.000	        1	[name]:2
	                   LAYER	           28.666	    2.939	    3.242	  9.769%	 52.895%	     0.000	        1	[name]:10
	                   LAYER	           20.404	    3.274	    3.220	  9.703%	 62.598%	     0.000	        1	[name]:8
	                   LAYER           15.505	    2.428	    2.466	  7.431%	 70.029%	     0.000	        1	[name]:6
	                   LAYER	           13.039	    2.482	    2.464	  7.424%	 77.453%	     0.000	        1	[name]:5
	                   LAYER	           17.972	    2.457	    2.431	  7.325%	 84.777%	     0.000	        1	[name]:7
	                   LAYER	            0.197	    1.824	    1.855	  5.590%	 90.368%	     0.000	        1	[name]:1
	                   LAYER	            6.138	    1.711	    1.708	  5.147%	 95.515%	     0.000	        1	[name]:3

Number of nodes executed: 13
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	                  LAYER	        3	     8.256	    24.883%	    24.883%	     0.000	        3
	          ACTIVATION	        1	     5.189	    15.639%	    40.523%	     0.000	        1
	          ACTIVATION	        1	     5.038	    15.184%	    55.707%	     0.000	        1
	                  LAYER	        1	     3.241	     9.768%	    65.475%	     0.000	        1
	                  LAYER	        1	     3.220	     9.705%	    75.180%	     0.000	        1
	                  LAYER	        1	     2.463	     7.423%	    82.603%	     0.000	        1
	                  LAYER	        1	     2.430	     7.324%	    89.927%	     0.000	        1
	                  LAYER	        1	     1.855	     5.591%	    95.518%	     0.000	        1
	                  LAYER	        1	     0.771	     2.324%	    97.842%	     0.000	        1
	                  LAYER	        1	     0.520	     1.567%	    99.409%	     0.000	        1
	                  LAYER	        1	     0.196	     0.591%	   100.000%	     0.000	        1

Timings (microseconds): count=50 first=32870 curr=33957 min=32618 max=33980 avg=33186.5 std=450
Memory (bytes): count=0
13 nodes observed ```

Here we can clearly see the timing which I'm getting is only 33 ms and this remains consistent.

Describe the expected behavior

When I use the same model in the android environment the timing increases to around 66ms which is just double the expected time. This 66ms time is for the interpreter.run() command.

Standalone code to reproduce the issue

Here is the java code which I'm using for doing inference -

import java.io.IOException;  
import java.io.InputStream;  
import java.nio.ByteBuffer;  
import java.nio.ByteOrder;  
import android.content.Context;  
import java.io.FileInputStream;  
import java.nio.MappedByteBuffer;  
import java.nio.channels.FileChannel;  
import org.tensorflow.lite.Interpreter;  
import org.tensorflow.lite.gpu.GpuDelegate;  
import android.content.res.AssetFileDescriptor;  
import static com.android.volley.VolleyLog.TAG;  
  
  
class sr_exec_tflite{  
  
  private Context context;  
  
 private int numThreads = 4;  
  
 protected Interpreter tflite_sr;  
  
 private GpuDelegate gpuDelegate = null;  
  
 private static int count=0;  
  
 private String modelfile = "main.tflite";  
  
 private int H_final = 1280; //change this  
  private int W_final = 720; //change this  
  
  
  int width_I = 640, height_I = 360; //chnage this  
  int width_O = W_final, height_O =H_final ;  
  
 int i_prod = width_I*height_I;  
 int o_prod = width_O*height_O;  
  
 private final Interpreter.Options tfliteOptions_sr = new Interpreter.Options();  
  
  sr_exec_tflite(Context context)  
  {  
    this.context = context;  
 try {  
   ;  
  
 byte[] br=new byte[4448]; //size of the tflite file in kb  
  InputStream isr = context.getAssets().open("main.tflite"); //the model file  
  isr.read(br);  
  System.out.println("----------------------------"+br.length);  
  //isr.read(filterCoefbytes,0,filterCoefbytes.length);  
  ByteBuffer bbr = ByteBuffer.allocateDirect(br.length);  
  bbr.put(br);  
  //bbr.order(ByteOrder.nativeOrder());  
  
  tflite_sr = new Interpreter(bbr, tfliteOptions_sr);  
  
  } catch (Exception e) {  
        System.out.print("Error in Interpreter Initialization:" + e);  
  }  
    }  
  
  
  public ByteBuffer sr_exec(byte[] y_channel_inp)  {  
    gpuDelegate = new GpuDelegate();  
  tfliteOptions_sr.addDelegate((gpuDelegate));  
  tfliteOptions_sr.setNumThreads(numThreads);  
  tfliteOptions_sr.setAllowFp16PrecisionForFp32(true);android en  
  
    ByteBuffer ybuffer = ByteBuffer.allocateDirect(i_prod).order(ByteOrder.nativeOrder());  
  ybuffer.rewind();  
  ybuffer.put(y_channel_inp);  
  
  ByteBuffer sr_y_channel = ByteBuffer.allocateDirect(o_prod).order(ByteOrder.nativeOrder());  
  
 try {  
      long startTime = System.currentTimeMillis();  
  tflite_sr.run(ybuffer,sr_y_channel); //sr  
  long stopTime = System.currentTimeMillis();  
 long elapsedTime = stopTime - startTime;  
  System.out.println("&&&&&&&&&&&&&&&&&&&&&&&&&&&"+elapsedTime);  
  
  }  
    catch (Exception e) {  
      Log.d(TAG, "Error in Inference: "+e);  
  }  
    return sr_y_channel;  
  }  
  
  private MappedByteBuffer loadModelFile(Context context, String MODEL_FILE) throws IOException {  
    AssetFileDescriptor fileDescriptor = context.getAssets().openFd(MODEL_FILE);  
  FileInputStream inputStream = new FileInputStream(fileDescriptor.getFileDescriptor());  
  FileChannel fileChannel = inputStream.getChannel();  
 long startOffset = fileDescriptor.getStartOffset();  
 long declaredLength = fileDescriptor.getDeclaredLength();  
 return fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength);  
  } ``` 

Other info / logs The one thing which I think can be a source of a problem is that while benchmarking I'm not using the GPU delegate as when I benchmarked on the GPU delegate it's giving me a very high time. So is there any way to use the CPU delegate on android as I couldn't find any example.

closed time in 2 months

anidh

issue commenttensorflow/tensorflow

[TFLite] Failed to create Hexagon delegate on Oneplus 5

Hi, Sorry for the late reply. You probably have some other changes in your gradle files that impact packaging files. You can try googling how to package share library in your project (in case you have some advanced gradle files). If it is packaged correctly and doesn't work feel free to reopen the bug or create new one. Thanks

Thanks for replying. I tried researching more around this issue and found that maybe the so files which are provided aren't compatible with my phone arch. So I found that you can generate your own so file using the bezel toolkit for dsp delegate using the command bazel build -c opt --config=android_arm64 tensorflow/lite/experimental/delegates/hexagon/java:tensorflow-lite-hexagon But I'm getting an error INFO: Call stack for the definition of repository 'hexagon_nn' which is a third_party_http_archive (rule definition at /home/anidh/tensorflow/third_party/repo.bzl:219:28):

  • /home/anidh/tensorflow/third_party/hexagon/workspace.bzl:8:5
  • /home/anidh/tensorflow/tensorflow/workspace.bzl:54:5
  • /home/anidh/tensorflow/tensorflow/workspace.bzl:108:5
  • /home/anidh/tensorflow/WORKSPACE:19:1 ERROR: /home/anidh/.cache/bazel/_bazel_anidh/b7383c6eddfc2c0b3ec247d3a8b0f2de/external/local_config_cc/BUILD:47:1: in cc_toolchain_suite rule @local_config_cc//:toolchain: cc_toolchain_suite '@local_config_cc//:toolchain' does not contain a toolchain for cpu 'arm64-v8a' ERROR: Analysis of target '//tensorflow/lite/experimental/delegates/hexagon/java:tensorflow-lite-hexagon' failed; build aborted: Analysis of target '@local_config_cc//:toolchain' failed; build aborted INFO: Elapsed time: 0.122s INFO: 0 processes. FAILED: Build did NOT complete successfully (0 packages loaded, 1 target configured)

Any help will be appreciated :)

hello,anidh I met the same problem with you, have you solved it?

Hi @yuqiu1233

I didn't get the time to follow up with this issue as I shifted my focus to running models on GPU, but I think you need to run the ./configure script inside the tensorflow repository as provide the location of SDk as well as NDK and bunch of other things. This is how I was able to run things for the GPU. Hope it helps as this is all I know for now.

anidh

comment created time in 2 months

issue commenttensorflow/tensorflow

Timing discrepancy between tflite benchmark tool and android

@anidh, In the issue template you have mentioned the TensorFlow version as v1.15 and later you have stated that you have built the model using TF v2.0. Could you please specify the exact TensorFlow version you are using?

Also, please update TensorFlow to the latest version v2.3 and check if you are facing the same issue. Thanks!

Hi There, I'm sorry for that, The model was trained using tf - 1.13.1 and the pb to tflite conversion was done using tf 1.15. The conversion command which was used was tflite_convert --output_file=espcnus_2x_harmonic_360p_rtr.tflite --graph_def_file=espcnus_2x_harmonic_360p_rtr.pb --input_arrays=x --output_arrays=y --input_shape=1,360,640,1 --inference_input_type=QUANTIZED_UINT8 --mean_values=0 --std_dev_values=255 . I've updated the original post with new details.

anidh

comment created time in 2 months

issue openedtensorflow/tensorflow

Timing discrepancy between tflite benchmark tool and android

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Oneplus 5, Oppo F11 Pro
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 1.15
  • Python version: 3.4
  • Bazel version (if compiling from source): 2.0
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: RTX 2060 6 GB

Describe the current behavior

I have created a Super Resolution model using tf 2.0 and used the tflite benchmark tool to benchmark the timing on my android device. The timing which I'm getting is around 31-35ms by using the command adb shell taskset f0 /data/local/tmp/benchmark_model \ --graph=/data/local/tmp/main.tflite \ --enable_op_profiling=true --num_threads=4

Here is the output from the benchmark tool -

Duplicate flags: num_threads
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [/data/local/tmp/espcn_harmonic_2x_slim_w_cl.tflite]
Enable op profiling: [1]
#threads used for CPU inference: [4]
Loaded model /data/local/tmp/espcn_harmonic_2x_slim_w_cl.tflite
The input model file size (MB): 0.004448
Initialized session in 0.855ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=15 first=49060 curr=33682 min=32825 max=49060 avg=34825.1 std=3930

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=32891 curr=33978 min=32641 max=34005 avg=33209.4 std=450

Inference timings in us: Init: 855, First inference: 49060, Warmup (avg): 34825.1, Inference (avg): 33209.4
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=0 overall=12.8672
Profiling Info for Benchmark Initialization:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    0.091	    0.091	100.000%	100.000%	     0.000	        1	AllocateTensors/0

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    0.091	    0.091	100.000%	100.000%	     0.000	        1	AllocateTensors/0

Number of nodes executed: 1
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	         AllocateTensors	        1	     0.091	   100.000%	   100.000%	     0.000	        1

Timings (microseconds): count=1 curr=91
Memory (bytes): count=0
1 nodes observed

Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	              DEQUANTIZE	            0.000	    0.215	    0.196	  0.591%	  0.591%	     0.000	        1	[x_dequantized]:0
	              MIRROR_PAD	            0.197	    1.824	    1.855	  5.590%	  6.181%	     0.000	        1	[espcn/MirrorPad]:1
	       DEPTHWISE_CONV_2D	            2.053	    4.071	    4.084	 12.306%	 18.487%	     0.000	        1	[espcn/conv1/separable_conv2d/depthwise]:2
	       DEPTHWISE_CONV_2D	            6.138	    1.711	    1.708	  5.147%	 23.634%	     0.000	        1	[espcn/conv1/BiasAdd]:3
	                    TANH	            7.848	    5.183	    5.190	 15.638%	 39.272%	     0.000	        1	[espcn/conv1/Tanh]:4
	       SPACE_TO_BATCH_ND	           13.039	    2.482	    2.464	  7.424%	 46.696%	     0.000	        1	[espcn/conv2/separable_conv2d/SpaceToBatchND]:5
	       DEPTHWISE_CONV_2D	           15.505	    2.428	    2.466	  7.431%	 54.126%	     0.000	        1	[espcn/conv2/separable_conv2d/depthwise]:6
	       BATCH_TO_SPACE_ND	           17.972	    2.457	    2.431	  7.325%	 61.451%	     0.000	        1	[espcn/conv2/separable_conv2d/BatchToSpaceND]:7
	                 CONV_2D	           20.404	    3.274	    3.220	  9.703%	 71.154%	     0.000	        1	[espcn/conv2/BiasAdd]:8
	                LOGISTIC	           23.626	    5.014	    5.039	 15.183%	 86.337%	     0.000	        1	[espcn/conv2/Sigmoid]:9
	          DEPTH_TO_SPACE	           28.666	    2.939	    3.242	  9.769%	 96.105%	     0.000	        1	[espcn/d2s]:10
	                     MUL	           31.910	    0.759	    0.772	  2.326%	 98.431%	     0.000	        1	[espcn/mul]:11
	                    CAST	           32.684	    0.513	    0.521	  1.569%	100.000%	     0.000	        1	[y]:12

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                    TANH	            7.848	    5.183	    5.190	 15.638%	 15.638%	     0.000	        1	[espcn/conv1/Tanh]:4
	                LOGISTIC	           23.626	    5.014	    5.039	 15.183%	 30.821%	     0.000	        1	[espcn/conv2/Sigmoid]:9
	       DEPTHWISE_CONV_2D	            2.053	    4.071	    4.084	 12.306%	 43.127%	     0.000	        1	[espcn/conv1/separable_conv2d/depthwise]:2
	          DEPTH_TO_SPACE	           28.666	    2.939	    3.242	  9.769%	 52.895%	     0.000	        1	[espcn/d2s]:10
	                 CONV_2D	           20.404	    3.274	    3.220	  9.703%	 62.598%	     0.000	        1	[espcn/conv2/BiasAdd]:8
	       DEPTHWISE_CONV_2D	           15.505	    2.428	    2.466	  7.431%	 70.029%	     0.000	        1	[espcn/conv2/separable_conv2d/depthwise]:6
	       SPACE_TO_BATCH_ND	           13.039	    2.482	    2.464	  7.424%	 77.453%	     0.000	        1	[espcn/conv2/separable_conv2d/SpaceToBatchND]:5
	       BATCH_TO_SPACE_ND	           17.972	    2.457	    2.431	  7.325%	 84.777%	     0.000	        1	[espcn/conv2/separable_conv2d/BatchToSpaceND]:7
	              MIRROR_PAD	            0.197	    1.824	    1.855	  5.590%	 90.368%	     0.000	        1	[espcn/MirrorPad]:1
	       DEPTHWISE_CONV_2D	            6.138	    1.711	    1.708	  5.147%	 95.515%	     0.000	        1	[espcn/conv1/BiasAdd]:3

Number of nodes executed: 13
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	       DEPTHWISE_CONV_2D	        3	     8.256	    24.883%	    24.883%	     0.000	        3
	                    TANH	        1	     5.189	    15.639%	    40.523%	     0.000	        1
	                LOGISTIC	        1	     5.038	    15.184%	    55.707%	     0.000	        1
	          DEPTH_TO_SPACE	        1	     3.241	     9.768%	    65.475%	     0.000	        1
	                 CONV_2D	        1	     3.220	     9.705%	    75.180%	     0.000	        1
	       SPACE_TO_BATCH_ND	        1	     2.463	     7.423%	    82.603%	     0.000	        1
	       BATCH_TO_SPACE_ND	        1	     2.430	     7.324%	    89.927%	     0.000	        1
	              MIRROR_PAD	        1	     1.855	     5.591%	    95.518%	     0.000	        1
	                     MUL	        1	     0.771	     2.324%	    97.842%	     0.000	        1
	                    CAST	        1	     0.520	     1.567%	    99.409%	     0.000	        1
	              DEQUANTIZE	        1	     0.196	     0.591%	   100.000%	     0.000	        1

Timings (microseconds): count=50 first=32870 curr=33957 min=32618 max=33980 avg=33186.5 std=450
Memory (bytes): count=0
13 nodes observed ```

Here we can clearly see the timing which i'm getting is only 33 ms and this remains consistent.

Describe the expected behavior

When i use the same model in the android enviornment the timing increases to around 66ms which is just the double of the expected time. This 66ms time is for the interpreter.run() command.

Standalone code to reproduce the issue

Here is the java code which I'm using for doing inference -

import java.io.IOException;  
import java.io.InputStream;  
import java.nio.ByteBuffer;  
import java.nio.ByteOrder;  
import android.content.Context;  
import java.io.FileInputStream;  
import java.nio.MappedByteBuffer;  
import java.nio.channels.FileChannel;  
import org.tensorflow.lite.Interpreter;  
import org.tensorflow.lite.gpu.GpuDelegate;  
import android.content.res.AssetFileDescriptor;  
import static com.android.volley.VolleyLog.TAG;  
  
  
class sr_exec_tflite{  
  
  private Context context;  
  
 private int numThreads = 4;  
  
 protected Interpreter tflite_sr;  
  
 private GpuDelegate gpuDelegate = null;  
  
 private static int count=0;  
  
 private String modelfile = "espcn_harmonic_2x_slim_wo_cl.tflite";  
  
 private int H_final = 1280; //change this  
  private int W_final = 720; //change this  
  
  
  int width_I = 640, height_I = 360; //chnage this  
  int width_O = W_final, height_O =H_final ;  
  
 int i_prod = width_I*height_I;  
 int o_prod = width_O*height_O;  
  
 private final Interpreter.Options tfliteOptions_sr = new Interpreter.Options();  
  
  sr_exec_tflite(Context context)  
  {  
    this.context = context;  
 try {  
   ;  
  
 byte[] br=new byte[4448]; //size of the tflite file in kb  
  InputStream isr = context.getAssets().open("espcn_harmonic_2x_slim_w_cl.tflite"); //the model file  
  isr.read(br);  
  System.out.println("----------------------------"+br.length);  
  //isr.read(filterCoefbytes,0,filterCoefbytes.length);  
  ByteBuffer bbr = ByteBuffer.allocateDirect(br.length);  
  bbr.put(br);  
  //bbr.order(ByteOrder.nativeOrder());  
  
  tflite_sr = new Interpreter(bbr, tfliteOptions_sr);  
  
  } catch (Exception e) {  
        System.out.print("Error in Interpreter Initialization:" + e);  
  }  
    }  
  
  
  public ByteBuffer sr_exec(byte[] y_channel_inp)  {  
    gpuDelegate = new GpuDelegate();  
  tfliteOptions_sr.addDelegate((gpuDelegate));  
  tfliteOptions_sr.setNumThreads(numThreads);  
  tfliteOptions_sr.setAllowFp16PrecisionForFp32(true);android en  
  
    ByteBuffer ybuffer = ByteBuffer.allocateDirect(i_prod).order(ByteOrder.nativeOrder());  
  ybuffer.rewind();  
  ybuffer.put(y_channel_inp);  
  
  ByteBuffer sr_y_channel = ByteBuffer.allocateDirect(o_prod).order(ByteOrder.nativeOrder());  
  
 try {  
      long startTime = System.currentTimeMillis();  
  tflite_sr.run(ybuffer,sr_y_channel); //sr  
  long stopTime = System.currentTimeMillis();  
 long elapsedTime = stopTime - startTime;  
  System.out.println("&&&&&&&&&&&&&&&&&&&&&&&&&&&"+elapsedTime);  
  
  }  
    catch (Exception e) {  
      Log.d(TAG, "Error in Inference: "+e);  
  }  
    return sr_y_channel;  
  }  
  
  private MappedByteBuffer loadModelFile(Context context, String MODEL_FILE) throws IOException {  
    AssetFileDescriptor fileDescriptor = context.getAssets().openFd(MODEL_FILE);  
  FileInputStream inputStream = new FileInputStream(fileDescriptor.getFileDescriptor());  
  FileChannel fileChannel = inputStream.getChannel();  
 long startOffset = fileDescriptor.getStartOffset();  
 long declaredLength = fileDescriptor.getDeclaredLength();  
 return fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength);  
  } ``` 

Other info / logs The one thing which I think can be source of a problem is that while benchmarking I'm not using the GPU delegate as when I benchmarked on the GPU delegate it's giving me a very high time. So is there any way to use the CPU delegate on android as I couldn't find any example.

created time in 2 months

more