[成功] 高解像度 HITNet 720x1280 のProtocol Buffers 2GB超過を回避した最適化の調査と試行

GatherElements_660 のひとつのOPだけで重み全体の1.8GB以上を占めるため、GatherElements_660 の前後のモデル部品へ分割したうえで最適化が可能かどうかを検証する。

sne4onnx \
--input_onnx_file_path hitnet_xl_sf_finalpass_from_tf_720x1280.onnx \
--output_onnx_file_path hitnet_xl_sf_finalpass_from_tf_720x1280_789_846.onnx \
--input_op_names 0,1 \
--output_op_names 789,846

import onnx
model = onnx.load('hitnet_xl_sf_finalpass_from_tf_720x1280_789_846.onnx')
model = onnx.shape_inference.infer_shapes(model)
onnx.save(model, 'hitnet_xl_sf_finalpass_from_tf_720x1280_789_846_shape.onnx')

onnxsim \
hitnet_xl_sf_finalpass_from_tf_720x1280_789_846_shape.onnx \
hitnet_xl_sf_finalpass_from_tf_720x1280_789_846_sim.onnx

PINTO

最適化前のモデルの GatherElements_660 の周辺の様子。図の一番下の赤枠の部分が GatherElements_660

PINTO

789, 846 までの範囲を抽出して onnx-simplifier で最適化した状態。

PINTO

続いて GatherElements_660 直後のOP ReduceL1_663 (907) 以降のグラフだけを抽出して最適化が可能化どうかを検証する。

PINTO

モデル抽出時点でエラーになる。

sne4onnx \
--input_onnx_file_path hitnet_xl_sf_finalpass_from_tf_720x1280.onnx \
--output_onnx_file_path hitnet_xl_sf_finalpass_from_tf_720x1280_1464.onnx \
--input_op_names 0,1,907 \
--output_op_names 1464

Traceback (most recent call last):
  File "/usr/local/bin/sne4onnx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/sne4onnx/onnx_network_extraction.py", line 106, in main
    extraction(
  File "/usr/local/lib/python3.8/dist-packages/sne4onnx/onnx_network_extraction.py", line 60, in extraction
    onnx.utils.extract_model(
  File "/usr/local/lib/python3.8/dist-packages/onnx/utils.py", line 166, in extract_model
    onnx.checker.check_model(output_path)
  File "/usr/local/lib/python3.8/dist-packages/onnx/checker.py", line 97, in check_model
    C.check_model_path(model)
onnx.onnx_cpp2py_export.checker.ValidationError: Field 'shape' of type is required but missing.

import onnx
model = onnx.load('hitnet_xl_sf_finalpass_from_tf_720x1280_1464.onnx')
model = onnx.shape_inference.infer_shapes(model)
onnx.save(model, 'hitnet_xl_sf_finalpass_from_tf_720x1280_1464_shape.onnx')

onnxsim \
hitnet_xl_sf_finalpass_from_tf_720x1280_1464_shape.onnx \
hitnet_xl_sf_finalpass_from_tf_720x1280_1464_sim.onnx

PINTO

モデルが極端に肥大化する原因は特定。２個の Tile で繰り返し操作によりパッチ用のインデックス値をINT64形式で超大量に生成している。

PINTO

opset=11 で採用される Tile OP の仕様。Tile の出力は、入力の型を踏襲する、と書かれている。従って、ひとつ前の Expand OP の出力が INT64 となっている状態がそのまま引き継がれて INT64 の状態の超巨大テンソルが錬成される。

PINTO

つまり、Expand の出力結果が INT64 となっている状態はそのままにして、Expand の直後に Cast OP を挟んで INT64 から INT32 へダウンキャストするように構造を変えればモデル全体のサイズはほぼ半分になるはず。

PINTO

で、結果的に想定は正しくて成功した。

PINTO

Expand 直後に INT32 の Cast OP を外挿して onnx-simplifier で Protocol Buffers の2GBファイルサイズ超過を回避してエクスポート成功。

import numpy as np
import onnx
import onnx_graphsurgeon as gs
from onnx_graphsurgeon.ir.tensor import Constant

MODEL='hitnet_xl_sf_finalpass_from_tf_720x1280.onnx'

graph = gs.import_onnx(onnx.load(MODEL))

for graph_node in graph.nodes:
    if graph_node.name == 'Expand_653':
        """
        graph_node.o()

        Tile_654 (Tile)
            Inputs: [
                Variable (896): (shape=None, dtype=None)
                Variable (893): (shape=None, dtype=None)
            ]
            Outputs: [
                Variable (897): (shape=None, dtype=None)
            ]
        """
        cast_out = gs.Variable("cast_out", dtype=np.int32)
        cast_node = gs.Node(op="Cast", inputs=graph_node.outputs, outputs=[cast_out])
        cast_node.attrs["to"] = onnx.TensorProto.INT32
        graph.nodes.append(cast_node)

        graph_node.o().inputs[0] = cast_node.outputs[0]
        break

graph.cleanup().toposort()
new_graph = gs.export_onnx(graph)
infered_graph = onnx.shape_inference.infer_shapes(new_graph)
onnx.save(infered_graph, f"{MODEL.split('.')[0]}_cast.onnx")

PINTO

issue発行