profile
viewpoint
Jakob Knapp jakobknapp Vienna University of Technology Vienna

dahe-cvl/vhh_cmc 0

VHH Project - Camera Movements Classification Package

dahe-cvl/vhh_core 0

VHH DL Framework Core Module

dahe-cvl/vhh_sbd 0

module to detect shots in historical videos

push eventdahe-cvl/vhh_core

Jakob Büro

commit sha be4d6400db6a3809e9435c78f50ca045496d5cf9

Update README.md

view details

push time in 5 days

push eventdahe-cvl/vhh_core

Jakob Büro

commit sha 75c12f211a04e0ebe2797e887ef39c8ef4e1a90d

Plugin pip support - changed plugins in requirements.txt from git downloads to pip packages (PyPI) - adjusted imports in Plugin scripts

view details

Jakob Büro

commit sha ea027807da4d14d5adc23ec7ac27ddc4b9196e0a

Relative result paths, plugin flags and core result formats - changed result paths in Config files to relative paths - automatically create result folder structure - added result folders to gitignore file - added possibility to select result format (local .csv, local .json or REST API .json) in Core Config - added flags to disable Plugins on demand in Core Config

view details

Jakob Büro

commit sha ea442a44740bd49cf58f744a5d71c9246a537426

Update README.md - removed opencv section - provided link to PyTorch website

view details

push time in 5 days

create barnchdahe-cvl/vhh_core

branch : jknapp

created branch time in 5 days

push eventdahe-cvl/vhh_stc

Jakob Büro

commit sha 0460735354bc5c62350370c318921e7736ee58ee

Setup Adjustment - Changed setup.py to use setuptools and wheel for setup - renamed package from stc to vhh_stc and changed imports accordingly - Extended README

view details

Jakob Büro

commit sha 892b21474d91cb3fdfa94fd65b9bdd2e7bd206fd

Setup Meta-Data Fix corrected meta-data errors in setup.py

view details

Jakob Büro

commit sha a1d7fd9a615365ed2a731d114f15a54cb3431430

Updated Readme with Infos for Release Generation

view details

push time in 13 days

push eventdahe-cvl/vhh_stc

Jakob Büro

commit sha a1d7fd9a615365ed2a731d114f15a54cb3431430

Updated Readme with Infos for Release Generation

view details

push time in 13 days

push eventdahe-cvl/vhh_cmc

Jakob Büro

commit sha 3bd84a2256becd31de98f58786ecbbf8da425324

Setup Adjustment - Changed setup.py to use setuptools and wheel for setup - renamed package from cmc to vhh_cmc and changed imports accordingly - Extended README

view details

Jakob Büro

commit sha 94fc1182606553e7c168067df43ad46d740426a1

Update README.md fixed duplicate note in README

view details

Jakob Büro

commit sha 49b2f680d492ce4900ab60f351f40a280bfc1e45

Updated Readme with Infos for Release Generation

view details

push time in 13 days

push eventdahe-cvl/vhh_cmc

Jakob Büro

commit sha 49b2f680d492ce4900ab60f351f40a280bfc1e45

Updated Readme with Infos for Release Generation

view details

push time in 13 days

push eventdahe-cvl/vhh_sbd

Jakob Büro

commit sha fcdcd9f35a4478d1e4e1c3ae6b609ea33bafdda9

Added more Infos for Release Generation new branch Sphinx doc

view details

push time in 13 days

push eventdahe-cvl/vhh_sbd

Jakob Büro

commit sha 15f158041278308db872929c1e5e0f22d8167384

Fixed Package Setup - Changed setup.py to only use setuptools and parse requirements.txt manually. - Renamed package from sbd to vhh_sbd. - Adjusted imports to reflect package renaming. - Extended README.

view details

Jakob Büro

commit sha 493646b1322a6d7f8dc477aa8f7863ef8bc084af

CPU/GPU compatibility - Added USE_GPU Flag in Config - Adjusted Configuration and Model modules accordingly.

view details

Jakob Büro

commit sha 084d14c04068413b104d8754a2c09869c4ce30e3

Added GPU Flag to Config

view details

Jakob Büro

commit sha 658ddb254d8e9f4d209be14f1775bb226542fc7b

Release Instructions Added Instructions on how to build files for realease and upload to PyPI.

view details

Jakob Büro

commit sha c081a3062f67c7b4393a75cdb3a31ad30618e35d

Clarified Install Instructions

view details

push time in 13 days

push eventdahe-cvl/vhh_sbd

Jakob Büro

commit sha c081a3062f67c7b4393a75cdb3a31ad30618e35d

Clarified Install Instructions

view details

push time in 13 days

push eventdahe-cvl/vhh_sbd

Jakob Büro

commit sha 658ddb254d8e9f4d209be14f1775bb226542fc7b

Release Instructions Added Instructions on how to build files for realease and upload to PyPI.

view details

push time in 13 days

push eventdahe-cvl/vhh_cmc

Jakob Büro

commit sha 94fc1182606553e7c168067df43ad46d740426a1

Update README.md fixed duplicate note in README

view details

push time in a month

push eventdahe-cvl/vhh_sbd

Jakob Büro

commit sha 084d14c04068413b104d8754a2c09869c4ce30e3

Added GPU Flag to Config

view details

push time in a month

create barnchdahe-cvl/vhh_cmc

branch : jknapp

created branch time in a month

push eventdahe-cvl/vhh_stc

Jakob Büro

commit sha 892b21474d91cb3fdfa94fd65b9bdd2e7bd206fd

Setup Meta-Data Fix corrected meta-data errors in setup.py

view details

push time in a month

create barnchdahe-cvl/vhh_stc

branch : jknapp

created branch time in a month

create barnchdahe-cvl/vhh_sbd

branch : jknapp

created branch time in a month

issue commenttensorflow/tensorflow

Simple TFLite UNet slower on mobile GPU than CPU

Hi Chao, thanks for checking! Yes, that seems reasonable. Guess I have to stick with the CPU for now.

One last question before I close this issue: Which specs of the mobile GPU do I have to pay special attention to when considering to use a GPU delegate? Does TFLite extensively utilize vec4 computations in particular?

jakobknapp

comment created time in 3 months

issue commenttensorflow/tensorflow

Simple TFLite UNet slower on mobile GPU than CPU

Hi,

yes, I could get my hands on a Samsung Galaxy S6 for comparison. Unfortunately, I do not have root access and also cannot use run-as, so I could not use your benchmark tool on the S6. However, my Android test-app behaves as expected: ~710ms/frame on GPU vs. ~850ms/frame on CPU

So the S6 seems to be faster than the P20 Lite in general and inference is faster on the GPU than on the CPU. On the other hand, it also seems to use OpenGL instead of OpenCL according to the Logs:

I/tflite: Created TensorFlow Lite delegate for GPU.
I/tflite: Initialized TensorFlow Lite runtime.
D/libEGL: loaded /vendor/lib64/egl/libGLES_mali.so
E/libEGL: validate_display:99 error 3008 (EGL_BAD_DISPLAY)
E/libEGL: call to OpenGL ES API with no current context (logged once per thread)
I/tflite: Initialized OpenGL-based API.
I/tflite: Created 1 GPU delegate kernels.

The OpenGL ES/OpenCL versions of the two phones (Huawei P20 Lite with Mali-T830 GPU vs. Samsung Galaxy S6 with Mali-T760 GPU) are almost identical. I used this app and this app for checking:

P20 Lite:

P20Lite Drivers

Galaxy S6:

S6 Drivers

The main difference is that the S6 has 8 compute units while the P20 Lite has only 2. According to the Performance tab, the S6 (right) also seems to have a higher float vec4 throughput:

Performance Comparison

jakobknapp

comment created time in 3 months

issue commenttensorflow/tensorflow

Simple TFLite UNet slower on mobile GPU than CPU

Hi Chao, thanks for looking into that issue!

Your results seem to suggest that the model should be alright and the bad performance is either due to some error I have in my Android application or because my phone can't keep up.

I used the pre-built benchmark tool you linked and the results are similar to my Android application. While inference is a bit faster in total, GPU inference is still slower by the same factor relative to the CPU.

Average Inference time on CPU: ~1s

WARNING: linker: "/data/data/org.picsart.bgremoval/tf_test/benchmark_tflite_model" unused DT entry: type 0xf arg 0x930
STARTING!
Duplicate flags: num_threads
Log parameter values verbosely: [0]
Num threads: [6]
Graph: [keras_unet.tflite]
Enable op profiling: [1]
CSV File to export profiling data to: [benchmark_cpu.csv]
#threads used for CPU inference: [6]
Loaded model keras_unet.tflite
INFO: Initialized TensorFlow Lite runtime.
The input model file size (MB): 7.70576
Initialized session in 3.487ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=1289876

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=1064709 curr=1013050 min=997794 max=1064709 avg=1.01013e+06 std=10693

Inference timings in us: Init: 3487, First inference: 1289876, Warmup (avg): 1.28988e+06, Inference (avg): 1.01013e+06
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.11719 overall=314.234

Average Inference time on GPU using OpenGL backend: ~2.4s

WARNING: linker: "/data/data/org.picsart.bgremoval/tf_test/benchmark_tflite_model" unused DT entry: type 0xf arg 0x930
STARTING!
Duplicate flags: num_threads
Log parameter values verbosely: [0]
Graph: [keras_unet.tflite]
Enable op profiling: [1]
CSV File to export profiling data to: [benchmark_gpu.csv]
Use gpu: [1]
Loaded model keras_unet.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
ERROR: TensorDescriptor don't have selector with name - Slice
ERROR: Falling back to OpenGL
INFO: Initialized OpenGL-based API.
INFO: Created 1 GPU delegate kernels.
Applied GPU delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 7.70576
Initialized session in 2101.73ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=2503969

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=2345432 curr=2325535 min=2284622 max=2461721 avg=2.35429e+06 std=31681

Inference timings in us: Init: 2101733, First inference: 2503969, Warmup (avg): 2.50397e+06, Inference (avg): 2.35429e+06
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=102.875 overall=102.875

However, I could not get it to run using the OpenCL GPU backend you mentioned. I have read that this should be the default behavior, but if no backend is specified, it falls back to the OpenGL backend as seen above. I have also tried enforcing the OpenCL backend by using --gpu_backend="cl", which failed:

WARNING: linker: "/data/data/org.picsart.bgremoval/tf_test/benchmark_tflite_model" unused DT entry: type 0xf arg 0x930
STARTING!
Duplicate flags: num_threads
Log parameter values verbosely: [0]
Graph: [keras_unet.tflite]
Use gpu: [1]
GPU backend: [cl]
Loaded model keras_unet.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite delegate for GPU.
ERROR: TfLiteGpuDelegate Init: TensorDescriptor don't have selector with name - Slice
INFO: Created 0 GPU delegate kernels.
ERROR: TfLiteGpuDelegate Prepare: delegate is not initialized
ERROR: Node number 24 (TfLiteGpuDelegateV2) failed to prepare.

ERROR: Restored original execution plan after delegate application failure.
Failed to apply GPU delegate.
Benchmarking failed.

Here is the OP Profiling Output in case it provides any additional insight: benchmark_op_profiling.zip

So it looks like both the Model and the Android application are fine, and the issue is somehow related to OpenCL on my device, as it works properly on yours.

jakobknapp

comment created time in 3 months

issue openedtensorflow/tensorflow

Simple TFLite UNet slower on mobile GPU than CPU

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution: Linux Ubuntu 18.04, Android 8.0.0
  • Mobile device if the issue happens on mobile device: Huawei P20 Lite (ANE-LX1)
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): v2.2.0-rc4-8-g2b96f3662b
  • Python version: 3.6.9
  • CUDA/cuDNN version: CUDA 10.1
  • GPU model: Mali-T830 MP2

Describe the current behavior

I have written a simple image segmentation model based on UNet in Keras, which I want to use on an Android device (Huawei P20 Lite). It consists solely of Conv2D, Conv2DTranspose, and Concatenate layers, which should all be supported on mobile GPUs by TFLite according to the documentation. The model has ~1.9M parameters and expects input tensors of shape [1, 224, 224, 3] and type float32. It is not trained yet, as I want to check its performance first before addressing the accuracy. I exported the model using

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with tf.io.gfile.GFile('trained models/keras_unet.tflite', 'wb') as f:
  f.write(tflite_model)

You can find the exported TFLite model attached further down.

On the CPU of my device, inference takes about 1.3s on average using 6 threads. However, when employing the GPU delegate using

tfliteModel = FileUtil.loadMappedFile(this, MODEL_NAME_TF);
GpuDelegate gpu_del = new GpuDelegate();
tfliteOptions.addDelegate(gpu_del);
tflite = new Interpreter(tfliteModel, tfliteOptions);

the performance drops to about 3.5s per inference, which is more than 2.5 times slower compared to CPU.

Describe the expected behavior As the UNet model only consists of supported operations, the mobile GPU supports OpenGL ES 3.2, and model export and import as well as GPU delegate creation have been done as suggested in the TFLite Android Tutorial and the GPU delegate Guide, I would expect the UNet to infer faster on the GPU than on the CPU.

I would be glad if you could point out any implementation issues on my side that could potentially cause the comparably worse performance on the GPU. The guides and tutorials unfortunately do not offer any deeper insight. Is it possible that the model is simply too large for my GPU (and would there be an error if this was the case)?

Standalone code to reproduce the issue The following model exhibits the described behavior: tflite_model.zip

Other info / logs I have previously implemented the exact same model in PyTorch, the performance on the CPU is about the same. I then switched to Keras to utilize TFLite's GPU delegate. I used a subclassed approach to develop the UNet-like model in Keras.

I have read here that it might be beneficial to use a Tensor with c=4 instead to avoid memory copies. I tried adding a fourth channel to the model and the input images (shape [1, 224, 224, 4]), which made performance even worse.

The performance has been tested using a custom Android application. The timings mentioned above refer to the following call: tflite.run(inputTensorBuffer.getBuffer(), outputProbabilityBuffer.getBuffer().rewind()); I also attempted to profile the TFLite model using the TFLite Model Benchmark Tool, but the Tool failed to build correctly using Bazel. If this is of relevance for solving the problem at hand, I will file another issue for that.

Finally, here are the outputs produced by the Python script for exporting the model and the Android test application: python_out.txt android_out.txt

created time in 3 months

more