profile
viewpoint

Ask questionsSimple TFLite UNet slower on mobile GPU than CPU

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution: Linux Ubuntu 18.04, Android 8.0.0
  • Mobile device if the issue happens on mobile device: Huawei P20 Lite (ANE-LX1)
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): v2.2.0-rc4-8-g2b96f3662b
  • Python version: 3.6.9
  • CUDA/cuDNN version: CUDA 10.1
  • GPU model: Mali-T830 MP2

Describe the current behavior

I have written a simple image segmentation model based on UNet in Keras, which I want to use on an Android device (Huawei P20 Lite). It consists solely of Conv2D, Conv2DTranspose, and Concatenate layers, which should all be supported on mobile GPUs by TFLite according to the documentation. The model has ~1.9M parameters and expects input tensors of shape [1, 224, 224, 3] and type float32. It is not trained yet, as I want to check its performance first before addressing the accuracy. I exported the model using

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with tf.io.gfile.GFile('trained models/keras_unet.tflite', 'wb') as f:
  f.write(tflite_model)

You can find the exported TFLite model attached further down.

On the CPU of my device, inference takes about 1.3s on average using 6 threads. However, when employing the GPU delegate using

tfliteModel = FileUtil.loadMappedFile(this, MODEL_NAME_TF);
GpuDelegate gpu_del = new GpuDelegate();
tfliteOptions.addDelegate(gpu_del);
tflite = new Interpreter(tfliteModel, tfliteOptions);

the performance drops to about 3.5s per inference, which is more than 2.5 times slower compared to CPU.

Describe the expected behavior As the UNet model only consists of supported operations, the mobile GPU supports OpenGL ES 3.2, and model export and import as well as GPU delegate creation have been done as suggested in the TFLite Android Tutorial and the GPU delegate Guide, I would expect the UNet to infer faster on the GPU than on the CPU.

I would be glad if you could point out any implementation issues on my side that could potentially cause the comparably worse performance on the GPU. The guides and tutorials unfortunately do not offer any deeper insight. Is it possible that the model is simply too large for my GPU (and would there be an error if this was the case)?

Standalone code to reproduce the issue The following model exhibits the described behavior: tflite_model.zip

Other info / logs I have previously implemented the exact same model in PyTorch, the performance on the CPU is about the same. I then switched to Keras to utilize TFLite's GPU delegate. I used a subclassed approach to develop the UNet-like model in Keras.

I have read here that it might be beneficial to use a Tensor with c=4 instead to avoid memory copies. I tried adding a fourth channel to the model and the input images (shape [1, 224, 224, 4]), which made performance even worse.

The performance has been tested using a custom Android application. The timings mentioned above refer to the following call: tflite.run(inputTensorBuffer.getBuffer(), outputProbabilityBuffer.getBuffer().rewind()); I also attempted to profile the TFLite model using the TFLite Model Benchmark Tool, but the Tool failed to build correctly using Bazel. If this is of relevance for solving the problem at hand, I will file another issue for that.

Finally, here are the outputs produced by the Python script for exporting the model and the Android test application: python_out.txt android_out.txt

tensorflow/tensorflow

Answer questions renjie-liu

Hi Chao, can you help take a look?

thanks

useful!

Related questions

ModuleNotFoundError: No module named 'tensorflow.contrib' hot 9
Tf.Keras metrics issue hot 8
Error occurred when finalizing GeneratorDataset iterator hot 7
Error loading tensorflow hot 6
module 'tensorflow' has no attribute 'ConfigProto' hot 6
TF 2.0 'Tensor' object has no attribute 'numpy' while using .numpy() although eager execution enabled by default hot 6
tensorflow-gpu CUPTI errors
Lossy conversion from float32 to uint8. Range [0, 1]. Convert image to uint8 prior to saving to suppress this warning.
ModuleNotFoundError: No module named 'tensorflow.contrib'
When importing TensorFlow, error loading Hadoop
OSError: SavedModel file does not exist at: saved_model_dir/{saved_model.pbtxt|saved_model.pb}
AttributeError: module 'tensorflow.python.framework.op_def_registry' has no attribute 'register_op_list'
tf.keras.layers.Conv1DTranspose ?
[TF 2.0] tf.keras.optimizers.Adam hot 4
TF2.0 AutoGraph issue hot 4
source:https://uonfu.com/
Github User Rank List