TensorFlow Lite NNAPI delegate

iwatake

TensorFlow Lite NNAPI delegateのビルドから実行までに関してまとめる予定。
ここでは、C++ (NDK)から使うことを想定。

iwatake

Android 8.1以上で使える。けど実際はサポートしているOPのが改善されたAndroid Pie以上が推奨
GPU,DSP, NPUを使った高速化を実現
端末によっては期待通りの性能にならないことがある。NNAPIを使わないようにする端末リストを作れる
使い方は他のDelegateと同様に、TensorflowLite interpreter初期化の前に、NNAPI delegateを作って追加する
8-bit量子化モデルを使った方が性能は高くなる
非サポートのOPはCPU実行にフォールバックされる。結果、CPU単独実行よりも遅くなる可能性がある。そのような場合には、NNAPI/CPUで分離させずに最初からCPU上で動かすためのオプション有り(Android 10以降)

iwatake

コード全体をnnapi でgrepしてみた結果

https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/lite/BUILD#L264
https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/lite/c/BUILD#L67
- 特に設定は不要で、自動的にビルドに含まれる模様
https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/lite/interpreter.h#L373
- Interpreter::UseNNAPI は非推奨。~~tflite::NnApiDelegate() を使う~~
- StatefulNnApiDelegateクラスを使う
TfLiteInterpreterOptionsSetUseNNAPI はexperimental?
https://github.com/tensorflow/tensorflow/blob/v2.4.0/tensorflow/lite/delegates/nnapi/BUILD#L15
- Android以外のプラットフォーム(ios, windows) の場合はここで無効コードを使うように切り替えられる (Linuxは??)

ということで、使うためには特別なビルドは不要そう。

iwatake

ここには特に情報無し

iwatake

CreateNNAPIDelegate を参考にして実装。(メモリ解放未考慮)

tflite::StatefulNnApiDelegate::Options options;
// options.execution_preference = tflite::StatefulNnApiDelegate::Options::kSustainedSpeed;
// options.disallow_nnapi_cpu = true;
// options.allow_fp16 = true;
m_interpreter->ModifyGraphWithDelegate(new tflite::StatefulNnApiDelegate(options));

mobilenet_v2_1.0_224.tflite on Pixel 4aでの実行結果。SetNumThreads(1)

without NNAPI delegate: 29.762 msec
with NNAPI delegate: 52.815 msec
with NNAPI delegate (options.disallow_nnapi_cpu = true): 41.273 msec

NNAPI delegateの方が遅くなってしまっている。
CPUフォールバック無効設定をしたら、処理時間が変わったのでNNAPIを使ってはいる模様。

実行時のLogcatは下記の通り。
DSPとGPUといったデバイスは見つかっているど、IDevice::getSupportedOperations でエラー。ModifyGraphWithDelegate でもエラー。

I/tflite: Created TensorFlow Lite delegate for NNAPI.
I/Manager: DeviceManager::DeviceManager
I/Manager: findAvailableDevices
I/Manager: Found interface qti-default
I/Manager: Capab {.relaxedFloat32toFloat16PerformanceScalar = {.execTime = 0.600000, .powerUsage = 0.800000}, .relaxedFloat32toFloat16PerformanceTensor = {.execTime = 0.600000, .powerUsage = 0.800000}, .operandPerformance = [16]{{.type = FLOAT32, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = INT32, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = UINT32, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = TENSOR_FLOAT32, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = TENSOR_INT32, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = TENSOR_QUANT8_ASYMM, .info = {.execTime = 0.700000, .powerUsage = 0.700000}}, {.type = BOOL, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = TENSOR_QUANT16_SYMM, .info = {.execTime = 0.700000, .powerUsage = 0.700000}}, {.type = TENSOR_FLOAT16, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = TENSOR_BOOL8, .info = {.execTime = 0.600000, .powerUsage = 0.800000}}, {.type = FLOA
I/Manager: Found interface qti-dsp
I/Manager: Capab {.relaxedFloat32toFloat16PerformanceScalar = {.execTime = 2.000000, .powerUsage = 2.000000}, .relaxedFloat32toFloat16PerformanceTensor = {.execTime = 2.000000, .powerUsage = 2.000000}, .operandPerformance = [16]{{.type = FLOAT32, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = INT32, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = UINT32, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = TENSOR_FLOAT32, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = TENSOR_INT32, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = TENSOR_QUANT8_ASYMM, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = BOOL, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = TENSOR_QUANT16_SYMM, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = TENSOR_FLOAT16, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = TENSOR_BOOL8, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = FLOA
I/Manager: Found interface qti-gpu
I/Manager: Capab {.relaxedFloat32toFloat16PerformanceScalar = {.execTime = 1.100000, .powerUsage = 1.100000}, .relaxedFloat32toFloat16PerformanceTensor = {.execTime = 1.100000, .powerUsage = 1.100000}, .operandPerformance = [16]{{.type = FLOAT32, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = INT32, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = UINT32, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = TENSOR_FLOAT32, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = TENSOR_INT32, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = TENSOR_QUANT8_ASYMM, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = BOOL, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = TENSOR_QUANT16_SYMM, .info = {.execTime = 2.000000, .powerUsage = 2.000000}}, {.type = TENSOR_FLOAT16, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = TENSOR_BOOL8, .info = {.execTime = 1.100000, .powerUsage = 1.100000}}, {.type = FLOA
I/TypeManager: Failed to read /vendor/etc/nnapi_extensions_app_allowlist ; No app allowlisted for vendor extensions use.
E/ncehelpersampl: getSupportedOperations_1_2 failure: Status(EX_TRANSACTION_FAILED): 'FAILED_TRANSACTION: '
E/Manager: IDevice::getSupportedOperations returned the error GENERAL_FAILURE
E/ncehelpersampl: getSupportedOperations_1_2 failure: Status(EX_TRANSACTION_FAILED): 'FAILED_TRANSACTION: '
E/Manager: IDevice::getSupportedOperations returned the error GENERAL_FAILURE
E/ncehelpersampl: getSupportedOperations_1_2 failure: Status(EX_TRANSACTION_FAILED): 'FAILED_TRANSACTION: '
E/Manager: IDevice::getSupportedOperations returned the error GENERAL_FAILURE
W/ncehelpersample: type=1400 audit(0.0:93425): avc: denied { read } for path="/storage/emulated/0/Android/data/com.iwatake.viewandroidinferencehelpersample/files/Documents/resource/model/mobilenet_v2_1.0_224.tflite" dev="sdcardfs" ino=207110 scontext=u:r:hal_neuralnetworks_default:s0 tcontext=u:object_r:sdcardfs:s0 tclass=file permissive=0
E/tflite: ModifyGraphWithDelegate is disallowed when graph is immutable.
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.

iwatake

E/libc: Access denied finding property "ro.hardware.chipname"
E/tflite: ModifyGraphWithDelegate is disallowed when graph is immutable.
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.

GPU delegateでも上記エラーは出てたので、これらのエラーに関しては問題なさそう。

あと、モデルが軽すぎて転送に時間がかかっているだけ? と思ったけど、GPU delegateだと9.572[msec] だったので、同等以上の速度を期待したい。

iwatake

ここを見ると、No app allowlisted for vendor extensions use も問題なさそう。

ここまでの結果、IDevice::getSupportedOperations returned the error GENERAL_FAILURE が問題っぽい。
なんだろう。パーミッションかな?

iwatake

Galaxy S7 (Android 8.0.0)で実行すると、delegate create時にI/tflite: Created TensorFlow Lite delegate for NNAPI. ログは出るけど、特にエラーなく実行。
NNAPIはAndroid 8.1以降がサポートのため、恐らく完全に無視されているのだと思う。

iwatake

same here

iwatake

For NNAPI to take full advantage of a device’s hardware, drivers for the hardware must be present. On newer devices from Samsung running Android Q (Android 10), the drivers are available out of the box. For older devices, developers can compile and install the drivers themselves. To complete this procedure root access to the device is needed. The source code for the drivers is available at https://github.com/ARM-software/android-nn-driver.

デバイスドライバがインストールされている必要がある。そうでない場合は自分でコンパイルしてインストール必要があり、その際にはroot化が必要。
とのこと。逆に言うと、ドライバがちゃんと入っていたらルート化しないでも動くはず。

iwatake

Android 10 からAndroid 11にアップデートしたけど、状況変わらず。

iwatake

SDK Version = 30
NDK Version = 22.0.7026061
にアップデートしたけど状況変わらず。

アップデートのエラーはこれ

E/VersionedInterfaces: getSupportedOperations_1_3 failure: Status(EX_TRANSACTION_FAILED): 'FAILED_TRANSACTION: '
E/Manager: IDevice::getSupportedOperations returned the error GENERAL_FAILURE

iwatake

このフローの、Initializeで使用可能なハードウェア取得は出来てそう。その後のgetSupportedOperations でエラーになっている。ということはデバドラ自体は問題がないのかも？

によると、GENERAL_FAILURE if there is an unspecified error とのこと。内部エラーコードと思われるEX_TRANSACTION_FAILED をそのまま解釈すると、どこかの通信エラー。

モデルファイルをExternal Storageからそのまま読んでるのがダメかも? Androidのasset管理でやってみよう。

iwatake

ビンゴ!!

モデルファイルへのアクセスに問題があった模様。
tfliteのモデル読み込みをBuildFromFile ではなくて、　いったんバッファに読んでからBuildFromBuffer でモデルを作ったら、IDevice::getSupportedOperations returned the error GENERAL_FAILURE エラーは消えました。

	/*** Create network ***/
#if 0
	m_model = tflite::FlatBufferModel::BuildFromFile(modelFilename.c_str());
#else
	std::ifstream ifs(modelFilename, std::ios::binary);
	if (ifs) {
		ifs >> std::noskipws;
		(void)std::copy(std::istream_iterator<char>(ifs), std::istream_iterator<char>(), back_inserter(m_modelBuffer));
	} else {
		PRINT_E("Failed to read model (%s)\n", modelFilename.c_str());
		return RET_ERR;
	}
	ifs.close();

	m_model = tflite::FlatBufferModel::BuildFromBuffer(m_modelBuffer.data(), m_modelBuffer.size());

#endif

iwatake

mobilenet_v1_1.0_224_quant.tflite on Pixel 4a

NNAPI delegate (qti-dsp): 6.554
NNAPI delegate (qti-gpu): 19.597
GPU delegate: 14.582
CPU x 1 (w/o delegate): 19.286

mobilenet_v2_1.0_224.tflite on Pixel 4a

NNAPI delegate (qti-dsp): 39.771
NNAPI delegate (qti-gpu): 20.822
GPU delegate: 10.412
CPU x 1 (w/o delegate): 29.770

量子化モデルだとNNAPI(DSP)が圧倒的に速い。しかもCPU負荷13%程度。
NNAPI(GPU)の速度がCPUと同じだったのはCPU実行にフォールバックされているせい? (CPU使用率も100%だった)
FP32モデルだとNNAPI(DSP)が遅くなってしまった。NNAPI(GPU)はCPUよりは速いので、多少効果あり

GPUとNNAPI(GPU)を比べると、GPU delegateの方が速い。

このスクラップは2021/02/15にクローズされました

ログインするとコメントできます