はじめに

私のリポジトリは以下です。元の実装をリファクタリングしたり動画像に対応させたりしています。
https://github.com/s4k10503/YOLOv4ONNX

量子化

参照元のモデルはfp32っぽいので量子化して軽量化を試みます。
以下のコードで量子化できます。

import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'yolov4.onnx'
model_quant = 'yolov4_uint8.onnx'
quantized_model = quantize_dynamic(
    model_fp32,
    model_quant,
    weight_type=QuantType.QUInt8
)

さらに量子化後のモデルをシンプルにしてみます。

onnxsim ./yolov4_uint8.onnx ./yolov4_uint8.onnx

最終的に245MB -> 61.9MBになりました。QuantType.QInt8にするとエラーが出るので注意です。
静的量子化にした場合はエラーが出ないっぽい。

0y0

前処理

モデルのトレーニング時の仕様に合わせる必要があります。
ここでは、Resizeとパディングを行っています。

def image_preprocess(image: np.ndarray, target_size: tuple, gt_boxes: np.ndarray = None) -> np.ndarray:  # type: ignore
    """
    Preprocess the image by resizing and padding.

    Args:
        image (np.ndarray): Input image array of shape (height, width, channels).
        target_size (tuple): Target size for resizing (height, width).
        gt_boxes (np.ndarray, optional): Ground truth boxes. Defaults to None.

    Returns:
        np.ndarray: Resized and padded image.
    """

    ih, iw = target_size
    h, w, _ = image.shape

    # Calculate the scale and new width/height
    scale = min(iw / w, ih / h)
    nw, nh = int(scale * w), int(scale * h)

    # Resize the image
    image_resized = cv2.resize(image, (nw, nh))

    # Create a padded image with the new dimensions
    image_padded = np.full(shape=[ih, iw, 3], fill_value=128.0)
    dw, dh = (iw - nw) // 2, (ih - nh) // 2
    image_padded[dh:nh + dh, dw:nw + dw, :] = image_resized
    image_padded = image_padded / 255.

    # If ground truth boxes are provided, adjust their coordinates
    if gt_boxes is not None:
        gt_boxes[:, [0, 2]] = gt_boxes[:, [0, 2]] * scale + dw
        gt_boxes[:, [1, 3]] = gt_boxes[:, [1, 3]] * scale + dh
        return image_padded, gt_boxes

    return image_padded

処理の流れ

1 . スケーリング計算

入力画像の寸法と目的の寸法を比較し、どちらの寸法によっても画像がターゲットサイズを超えないようにスケーリング係数scaleを計算します。
このスケールを使って、新しい幅nwと高さnhを計算します。

2 . 画像のリサイズ

cv2.resize()関数を使用して、画像を計算された新しい寸法(nw, nh)にリサイズします。

3 . パディングの追加

リサイズされた画像を目的の(ih, iw)サイズに収まるようにパディングします。パディングは、リサイズされた画像が中央に来るように、水平方向と垂直方向に均等に追加されます。
例えば、リサイズされた画像のサイズが (120, 200) で、ターゲットサイズが (200, 200) の場合、垂直方向のパディング dh は40となります。この場合、スライシングは以下のようになります：
- 垂直方向: 40から160まで（120の高さ＋40の開始位置）
- 水平方向: 0から200まで（200の幅＋0の開始位置）
パディングされた画像のピクセル値を255で割ることで正規化を行い、画像のデータが0から1の範囲に収まるようにします。

4 . グランドトゥルースボックスの調整

オプショナルですが、gt_boxesが提供されている場合は、これらのボックスの座標もリサイズおよびパディングに合わせて調整します。これにより、物体の位置が正確に保たれます。

0y0

 後処理モデルのトレーニング時の仕様によって出力形状が変わってくるため注意です。

後処理には大きく分けると、DecodeとNMSがあります。

 Decodeその1今回のモデルでは、416×416の入力画像から3つの特徴マップが出力されます。
特徴マップ大：[B, 52, 52, 3, 85]
特徴マップ中：[B, 26, 26, 3, 85]
特徴マップ小：[B, 13, 13, 3, 85]
形状は[B, N, N, A, 85]です。ここで、各次元は以下のようになります。
B: バッチサイズ
N: 特徴マップの高さおよび幅
A: アンカー数（各セルに対するバウンディングボックスの予測数）
85: バウンディングボックスのパラメータ（dx, dy, dw, dh, confidence）+ クラス数
このモデルではCOCOデータセットを用いて学習が行われているようなのでクラス数は80ですね。
def postprocess_bbbox(pred_bbox: np.ndarray, ANCHORS: np.ndarray, STRIDES: np.ndarray, XYSCALE: list[float] = [1, 1, 1]) -> np.ndarray:
    """
    Postprocess bounding box predictions to get final predictions.

    Args:
        pred_bbox (np.ndarray): Predicted bounding boxes.
        ANCHORS (np.ndarray): Anchor values for bounding boxes.
        STRIDES (np.ndarray): Stride values for bounding boxes.
        XYSCALE (list[float], optional): Scaling factors for bounding boxes. Defaults to [1, 1, 1].

    Returns:
        np.ndarray: Postprocessed bounding boxes.
    """

    for i, pred in enumerate(pred_bbox):
        conv_shape = pred.shape
        output_size = conv_shape[1]

        # Extract dx, dy, dw, dh
        conv_raw_dxdy = pred[:, :, :, :, 0:2]
        conv_raw_dwdh = pred[:, :, :, :, 2:4]

        # Generate the grid
        xy_grid = np.meshgrid(np.arange(output_size), np.arange(output_size))
        xy_grid = np.expand_dims(np.stack(xy_grid, axis=-1), axis=2)
        xy_grid = np.tile(np.expand_dims(xy_grid, axis=0), [1, 1, 1, 3, 1])
        xy_grid = xy_grid.astype(np.float32)

        # Calculate pred_xy and pred_wh
        pred_xy = ((special.expit(conv_raw_dxdy) *
                   XYSCALE[i]) - 0.5 * (XYSCALE[i] - 1) + xy_grid) * STRIDES[i]
        pred_wh = (np.exp(conv_raw_dwdh) * ANCHORS[i])
        pred[:, :, :, :, 0:4] = np.concatenate([pred_xy, pred_wh], axis=-1)

    # Reshape and concatenate the bounding boxes
    pred_bbox = [np.reshape(x, (-1, np.shape(x)[-1])) for x in pred_bbox]
    pred_bbox = np.concatenate(pred_bbox, axis=0)
    return pred_bbox

 引数pred_bbox
予測されたバウンディングボックスの情報が含まれるnumpy配列です。
形状は通常 (バッチサイズ, グリッドサイズ, グリッドサイズ, アンカー数, 5+クラス数) であり、最後の次元にはdx, dy, dw, dh, confidence, クラス確率が含まれます。
ANCHORS
各アンカーに対するバウンディングボックスの幅と高さの事前設定値を含むnumpy配列です。
STRIDES
特徴マップから元の画像へのスケールを表す値の配列です。
これにより、モデルの予測がどのスケールに対応するかが決まります。
XYSCALE
座標のスケーリングを調整するための係数です。
デフォルトは [1, 1, 1] で、これはスケーリングを変更しないことを意味します。

 処理の流れバウンディングボックスの変換

この関数はまず予測されたバウンディングボックスの各座標について、グリッドのオフセットを加算し、実際の画像座標系での位置を計算します。
conv_raw_dxdy と conv_raw_dwdh で dx, dy, dw, dh を抽出します。
special.expit 関数（シグモイド関数）を使って dx, dy を [0, 1] の範囲に正規化します。この値はグリッド内の相対的な位置を示します。
XYSCALE によってこれらの値をスケーリングし、グリッドの中心からのオフセットを計算します。
xy_grid を用いて、各グリッドセルの中心座標を生成し、これにスケーリングされた dx, dy を加算して実際の画像上の位置を得ます。
dw, dh は np.exp を使用してスケール変換し、ANCHORS を掛け合わせて実際の幅と高さを求めます。
バウンディングボックスの再形成と結合

最後に、すべての予測されたバウンディングボックスを1次元のリストに再形成し、バッチ全体の予測を一つの配列に結合します。これにより、処理を簡単にし、後続の処理（例えばNMS）に適した形式にします。

0y0

Decodeその2

def postprocess_boxes(pred_bbox: np.ndarray, org_img_shape: tuple, input_size: int, score_threshold: float) -> np.ndarray:
    """
    Postprocess bounding boxes by resizing and removing invalid ones.

    Args:
        pred_bbox (np.ndarray): Predicted bounding boxes.
        org_img_shape (tuple): Original image shape (height, width).
        input_size (int): Size of the input image after preprocessing.
        score_threshold (float): Threshold for valid bounding boxes.

    Returns:
        np.ndarray: Resized and filtered bounding boxes.
    """

    valid_scale = [0, np.inf]
    pred_bbox = np.array(pred_bbox)
    pred_xywh = pred_bbox[:, 0:4]
    pred_conf = pred_bbox[:, 4]
    pred_prob = pred_bbox[:, 5:]

    # (x, y, w, h) --> (xmin, ymin, xmax, ymax)
    pred_coor = np.concatenate([pred_xywh[:, :2] - pred_xywh[:, 2:] * 0.5,
                                pred_xywh[:, :2] + pred_xywh[:, 2:] * 0.5], axis=-1)

    # (xmin, ymin, xmax, ymax) -> (xmin_org, ymin_org, xmax_org, ymax_org)
    org_h, org_w = org_img_shape
    resize_ratio = min(input_size / org_w, input_size / org_h)
    dw = (input_size - resize_ratio * org_w) / 2
    dh = (input_size - resize_ratio * org_h) / 2

    # Resize coordinates
    pred_coor[:, 0::2] = 1.0 * (pred_coor[:, 0::2] - dw) / resize_ratio
    pred_coor[:, 1::2] = 1.0 * (pred_coor[:, 1::2] - dh) / resize_ratio

    # clip some boxes that are out of range
    pred_coor = np.concatenate([np.maximum(pred_coor[:, :2], [0, 0]),
                                np.minimum(pred_coor[:, 2:], [org_w - 1, org_h - 1])], axis=-1)
    invalid_mask = np.logical_or(
        (pred_coor[:, 0] > pred_coor[:, 2]), (pred_coor[:, 1] > pred_coor[:, 3]))
    pred_coor[invalid_mask] = 0

    # discard some invalid boxes
    bboxes_scale = np.sqrt(np.multiply.reduce(
        pred_coor[:, 2:4] - pred_coor[:, 0:2], axis=-1))
    scale_mask = np.logical_and(
        (valid_scale[0] < bboxes_scale), (bboxes_scale < valid_scale[1]))

    # discard some boxes with low scores
    classes = np.argmax(pred_prob, axis=-1)
    scores = pred_conf * pred_prob[np.arange(len(pred_coor)), classes]
    score_mask = scores > score_threshold
    mask = np.logical_and(scale_mask, score_mask)
    coors, scores, classes = pred_coor[mask], scores[mask], classes[mask]

    return np.concatenate([coors, scores[:, np.newaxis], classes[:, np.newaxis]], axis=-1)