Ask questionsSimple TFLite UNet slower on mobile GPU than CPU
Describe the current behavior
I have written a simple image segmentation model based on UNet in Keras, which I want to use on an Android device (Huawei P20 Lite). It consists solely of Conv2D, Conv2DTranspose, and Concatenate layers, which should all be supported on mobile GPUs by TFLite according to the documentation. The model has ~1.9M parameters and expects input tensors of shape [1, 224, 224, 3] and type float32. It is not trained yet, as I want to check its performance first before addressing the accuracy. I exported the model using
converter = tf.lite.TFLiteConverter.from_keras_model(model) tflite_model = converter.convert() with tf.io.gfile.GFile('trained models/keras_unet.tflite', 'wb') as f: f.write(tflite_model)
You can find the exported TFLite model attached further down.
On the CPU of my device, inference takes about 1.3s on average using 6 threads. However, when employing the GPU delegate using
tfliteModel = FileUtil.loadMappedFile(this, MODEL_NAME_TF); GpuDelegate gpu_del = new GpuDelegate(); tfliteOptions.addDelegate(gpu_del); tflite = new Interpreter(tfliteModel, tfliteOptions);
the performance drops to about 3.5s per inference, which is more than 2.5 times slower compared to CPU.
Describe the expected behavior As the UNet model only consists of supported operations, the mobile GPU supports OpenGL ES 3.2, and model export and import as well as GPU delegate creation have been done as suggested in the TFLite Android Tutorial and the GPU delegate Guide, I would expect the UNet to infer faster on the GPU than on the CPU.
I would be glad if you could point out any implementation issues on my side that could potentially cause the comparably worse performance on the GPU. The guides and tutorials unfortunately do not offer any deeper insight. Is it possible that the model is simply too large for my GPU (and would there be an error if this was the case)?
Standalone code to reproduce the issue The following model exhibits the described behavior: tflite_model.zip
Other info / logs I have previously implemented the exact same model in PyTorch, the performance on the CPU is about the same. I then switched to Keras to utilize TFLite's GPU delegate. I used a subclassed approach to develop the UNet-like model in Keras.
I have read here that it might be beneficial to use a Tensor with c=4 instead to avoid memory copies. I tried adding a fourth channel to the model and the input images (shape [1, 224, 224, 4]), which made performance even worse.
The performance has been tested using a custom Android application. The timings mentioned above refer to the following call:
I also attempted to profile the TFLite model using the TFLite Model Benchmark Tool, but the Tool failed to build correctly using Bazel. If this is of relevance for solving the problem at hand, I will file another issue for that.
Answer questions renjie-liu
Hi Chao, can you help take a look?