🧠

Flutter で TensorFlow Lite (YOLO & SSD) 動してみた

2024/05/20に公開2件

Flutter

やったこと

Flutter で TensorFlow Lite （SSD, YOLO）を動かしてみました　💪

Github Repository

参考資料

上記記事を参考させていただき，作成しました　（大変参考になりました　🙇）
本実装はこちらの記事を参考に作成しましたので，まずはこちらからご参照ください
本記事では， GPU Delegate Option など +α の部分を特に備忘録としてメモできたらなと思います

おおよその性能 (iPhone 15 Pro 動作)

※ 長時間作動させるとフレームレートにばらつきが出てきます

YOLOv5 (n model) ... 5 ~ 8 fps
SSD Mobilenet v2 ... 40 ~ 60 fps

Nativeコードでの開発に比べると，フレームレートが遅いのかもしれませんが，
高速に動作させるためのライブラリが開発されているので，ある程度簡単にいいものができるみたいです．

YOLOv5を動かしている時の画面 SSD MobileNetを動かしている時の画面

使用したライブラリ

tflite_flutter (0.9.1)

dart:ffiを使ったc++呼び出しで推論処理の高速化を実現するためのライブラリ
GPU Delegate とかも対応　（モバイルデバイス上のGPUを使用するためのオプション）
- この辺書いている記事が少なかったので参考になれば幸いです

tflite_flutter_helper (0.3.1)

画像の前処理などを行ってくれるライブラリ
とても便利なのですが，サポートが終了しており，　tflite_flutter: ^0.9.2　以降では使えない ... （後継があるのかな ...）
とはいえ画像の前処理を自分で実装すると処理が遅くなってしまったので，今回は tflite_flutter を 0.9.1 に落としてこちらを使用
※ 後継が出ているのかもしれません．ご存知の方いましたら教えてください 🙇

実装

今回は YOLO と SSD の 2 つのモデルを試せるアプリケーションが欲しいなと思って，それぞれ切り替えられるようなアプリケーションを作成しました．
前述している通り，参考記事のコードを大きく参考にしているのでまずはそちらをご参考ください．

モデルのロード

モデルをロードする部分は下記のコード (classifier.dart) です．
大枠は参考記事のコードと同じなので，⇩のポイントについて解説します．

GPUを使用する方法　（iOSのMetal GPUのみ）
複数の物体検出モデルに対応する方法

  Future<void> loadModel(bool useGPU, String modelName) async {
    try {
      // set GPU delegate
      var options = InterpreterOptions();
      if(useGPU){
        final gpuDelegate = GpuDelegate(
          options: GpuDelegateOptions(
            allowPrecisionLoss: true,
            waitType: TFLGpuDelegateWaitType.passive,
            enableQuantization: (modelName == "ssd_mobilenet_uint8.tflite") ? true : false,
          ),
        );
        options.addDelegate(gpuDelegate);
      }else{
        options.threads = 4;
      }

      // load model
      _interpreter = await Interpreter.fromAsset(
        modelName,
        options: options,
      );

      // get input and output tensor
      inputSize  = _interpreter!.getInputTensor(0).shape[1];
      tensorType = _interpreter!.getInputTensors()[0].type;

      // get output tensor      
      outputShapes = [];
      outputTypes  = [];

      for (final tensor in _interpreter!.getOutputTensors()) {
        outputShapes.add(tensor.shape);
        outputTypes.add(tensor.type);
      }

      // set decode function
      if (modelName == 'ssd_mobilenet_uint8.tflite') {
        decodeOutputsTensor = decodeSsdMobilenetOutputsTensor;
      } else {
        decodeOutputsTensor = decodeYoloOutputsTensor;
      }

    } on Exception catch (e) {
      logger.warning(e.toString());
    }
  }

GPUを使用する方法　（iOSのMetal GPUのみ）

float精度のモデルを使用している場合は，GPUが使いたくなります．
GPU を使いたい時 (useGPU == True) は，
Interpreter生成時に↓のようにoptionを指定する必要があるのですが，それだけではエラーが出てしまい,
GPUを使ってくれませんでした　．．．

  // set GPU delegate
  var options = InterpreterOptions();
  if(useGPU){
    final gpuDelegate = GpuDelegate(
      options: GpuDelegateOptions(
        allowPrecisionLoss: true,
        waitType: TFLGpuDelegateWaitType.passive,
        enableQuantization: (modelName == "ssd_mobilenet_uint8.tflite") ? true : false,
      ),
    );
    options.addDelegate(gpuDelegate);
  }else{
    options.threads = 4;
  }

  // load model
  _interpreter = await Interpreter.fromAsset(
    modelName,
    options: options,
  );

結局こちらに辿り着き，
TensorFlowLiteC.framework　ではなく TensorFlowLiteCMetal.framework　を使えとのこと．
ほんとは自分で build すべきなのでしょうが，今回はずるしてUnityのサンプルプロジェクトから拝借させていただきました． (issue でもそうしているようです)

これをios/.symlinks/plugins/tflite_flutter/ios/に配置することで GPU を使うことができるようになりました 😀
Float精度のモデルだと，確かに少し速くなります 😀

Android版は手元で試せないので分からないです　．．．
すみません　🙏

複数物体検出モデルへの対応

今回のアプリケーションは複数モデルに対応させています．
そこでまず，問題になるのが入力層と出力層の違いです．
YOLO と SSD はそれぞれ下記の通りです．

入力層の違い

特徴	SSD MobileNet	YOLO v5
入力サイズ	300x300	640x640
入力チャネル数	3 チャネル (RGB)	3 チャネル (RGB)
画素値の正規化	無し	有り

出力層の違い

項目	SSD MobileNet	YOLO v5
バウンディングボックス	[ymin, xmin, ymax, xmax]	[x_center, y_center, width, height]
スコアの位置	scoresList に個別に格納	バウンディングボックスと共に [objectness_score]
クラス ID の位置	classIdsList に個別に格納	クラススコアのリストとして [class_scores]
検出オブジェクト数	numDetectionsList に格納	単一のフラットなリスト内に全情報が含まれる
データの長さ	複数のリストで管理され、各リストは検出数に依存	フラットなリストで、各バウンディングボックスに (5 + クラス数) 要素

文字で書いても分かりづらいですね 🤨
つまり，入力層と出力層の配列の形がモデルによって違うので，モデルによって形を変えてあげて，
さらに中に入っているデータの形(並び)も違うのでそれも対応してあげようということです．

実装は上記コードの以下の部分です．

出力層と入力層の形に関しては，モデル情報に記載してあるので，それを元に定義しました．

  inputSize  = _interpreter!.getInputTensor(0).shape[1];
  tensorType = _interpreter!.getInputTensors()[0].type;

  // get output tensor      
  outputShapes = [];
  outputTypes  = [];

  for (final tensor in _interpreter!.getOutputTensors()) {
    outputShapes.add(tensor.shape);
    outputTypes.add(tensor.type);
  }

指定されたモデルファイルに応じて YOLO と SSD とで出力データの処理関数を切り替えています．

  // set decode function
  if (modelName == 'ssd_mobilenet_uint8.tflite') {
    decodeOutputsTensor = decodeSsdMobilenetOutputsTensor;
  } else {
    decodeOutputsTensor = decodeYoloOutputsTensor;
  }

具体的なそれぞれの関数については下記の通りです．

⇩ SSD用の出力層のデータ処理

  List<Recognition> decodeSsdMobilenetOutputsTensor(Map<int, ByteBuffer> outputs, int transHeight, int transWidth) {
    
    // convert output to List<Recognition>
    Float32List boxesList         = outputs[0]!.asFloat32List();
    Float32List classIdsList      = outputs[1]!.asFloat32List();
    Float32List scoresList        = outputs[2]!.asFloat32List();
    Float32List numDetectionsList = outputs[3]!.asFloat32List();

    int numDetections = numDetectionsList[0].toInt();

    List<Recognition> recognitions = [];
    for (int i = 0; i < numDetections; i++) {
      double y = boxesList[i * 4 + 0];
      double x = boxesList[i * 4 + 1];
      double h = boxesList[i * 4 + 2] - y;
      double w = boxesList[i * 4 + 3] - x;
      Rect rect = Rect.fromLTWH(x*inputSize, y*inputSize, w*inputSize, h*inputSize);
      Rect transformRect = imageProcessor!.inverseTransformRect(rect, transHeight, transWidth);
      if(scoresList[i] < objConfTh) continue;
      recognitions.add(Recognition(i, classIdsList[i].toInt(), scoresList[i], transformRect, false));
    }
    return recognitions;
  }

⇩ YOLO用の出力層のデータ処理

  List<Recognition> decodeYoloOutputsTensor(Map<int, ByteBuffer> outputs, int transHeight, int transWidth) {
    Float32List results = outputs[0]!.asFloat32List();
    List<Recognition> recognitions = [];

    for (var i = 0; i < results.length; i += (5 + clsNum)) {
      if (results[i + 4] < objConfTh) continue;

      List<double> clsScores = results.sublist(i + 5, i + 5 + clsNum);
      double maxClsConf = clsScores.reduce(max);
      if (maxClsConf < clsConfTh) continue;

      int cls = clsScores.indexOf(maxClsConf);
      Rect rect = Rect.fromCenter(
        center: Offset(
          results[i] * inputSize,
          results[i + 1] * inputSize,
        ),
        width: results[i + 2] * inputSize,
        height: results[i + 3] * inputSize,
      );
      Rect transformRect = imageProcessor!.inverseTransformRect(rect, transHeight, transWidth);

      recognitions.add(Recognition(i, cls, maxClsConf, transformRect, true));
    }

それぞれのモデルには出力層から得られるデータの読み解き方に違いがありますが，処理関数を分けることで，
最終的に得られる情報は　Recognition　Class （label や BoundingBox 座標）で統一しています．
デバッグしているとき，「あれ？バウンディングボックスが描画されないな　．．． 😭 」と思ったら大抵，入力層と出力層の形の違いが原因でした　．．．

モデルの切り替え処理の実装

今回は YOLO　と SSD を切り替えて使用したかったので，モデルをアプリ使用中に切り替えたいです．
モデルは切り替える時には Interpreter を再生成するしかなさそうなので↓のような実装で Classifier ごと再生成しています．

lib/tflite/ml_camera.dart

  Future<void> changeModel(bool useGPU, String modelName) async {
    isPrepearing = true;
    isPredicting = false;
    stopIsolate();
    classifier = Classifier(
      useGPU:    useGPU,
      modelName: modelName,
    );
    initIsolate();
    isPrepearing = false;
  }

stopIsolate, InitIsolate ではIsolateで投げたプロセスが残っているとまずいので，終了を待ち，インスタンス生成後に再び開始するという処理を入れています．

  void stopIsolate() {
    receivePort.close();
    isolate.kill(priority: Isolate.immediate);
  }

  Future<void> initIsolate() async {
    receivePort = ReceivePort();
    isPredicting = false;
    isolate = await Isolate.spawn(entryPoint, receivePort.sendPort);
    sendPort = await receivePort.first as SendPort;
    isPrepearing = false;
  }

画像の前処理

前述した通り，YOLOとSSDでは画像の前処理も異なります
以下， lib/tflite/classifier.dart に記述している前処理の実装です．
モデルが量子化されている場合， tensorBuffer　が "Int型" なのか "Float" 型なのかが重量です．
確か間違えるとエラーが出るはずです.

今回は，SSDでのみ整数精度 (Int) のモデルを使用していたので，そのような実装になっています．

    // convert image to TensorImage and preprocess it
    TensorImage inputImage = getPreprocessedImage(TensorImage.fromImage(image));

    var normalizedTensorBuffer = TensorBuffer.createDynamic(TfLiteType.float32);

    List<int> shape = [inputSize, inputSize, 3];

    // create input tensor
    if (tensorType == TfLiteType.uint8) {
      List<int> normalizedInputImage = [];
      for (var pixel in inputImage.tensorBuffer.getIntList()) {
        normalizedInputImage.add(pixel.toInt());
      }
      normalizedTensorBuffer = TensorBuffer.createDynamic(tensorType);
      normalizedTensorBuffer.loadList(normalizedInputImage, shape: shape);
    } else {
      List<double> normalizedInputImage = [];
      for (var pixel in inputImage.tensorBuffer.getDoubleList()) {
        normalizedInputImage.add(pixel/255);
      }
      normalizedTensorBuffer = TensorBuffer.createDynamic(tensorType);
      normalizedTensorBuffer.loadList(normalizedInputImage, shape: shape);
    }
    final inputs = [normalizedTensorBuffer.buffer];

あと参考記事でも述べられていますが，YOLOのモデルは画素値を 0 ~ 1 に正規化しないといけないので，お忘れなく！

最後に

特に参考記事でカバーできていない部分を中心に書いてみましたので，ご参考になれば幸いです．
といっても私ペーペーですので，気になる箇所ございましたご指摘ください． (もちろんご質問も！)

初めて，モバイルデバイスでこんなに重たい処理を動かしました．
シングルスレッド処理がメインのFlutterでは Isolate というマルチスレッドで処理を行うための機能があることを知れてよかったです 😀

Discussion

keihsgw892kgsm

こんにちは。
興味深い技術共有をありがとうございます！

私もGPUを使って推論させてみようと思い、その手順について質問です！
「GPUを使用する方法　（iOSのMetal GPUのみ）」の項目に、

TensorFlowLiteC.framework　ではなく TensorFlowLiteCMetal.framework　を使えとのこと．
ほんとは自分で build すべきなのでしょうが，今回はずるしてUnityのサンプルプロジェクトから拝借させていただきました． (issue でもそうしているようです)

これをios/.symlinks/plugins/tflite_flutter/ios/に配置することで GPU を使うことができるようになりました 😀

とありますが、
Unityのサンプルプロジェクトの2フォルダと2ファイルをios/.symlinks/plugins/tflite_flutter/ios/に配置するという意味であっていますでしょうか？

どうかご教示お願いいたします！

プログラム書くパンダ

コメントありがとうございます．
言葉足らずですみません．

配置すべきファイルは
TensorFlowLiteCMetal.framework と TensorFlowLiteC.frameworkです．
もしかすると後者は必要ないかもしれません．