🫒

microsoft Oliveを使ったモデルの最適化

yu shimohama

2024/11/30に公開

背景

エッジデバイス案件で、SLMのファインチューニングにOliveを使用していたのですが、2024/11/25以降？でOliveのドキュメントが整備されて充実し、色々出来ることがわかかりました。

なお、Oliveというワードだけでググっても、認知度がまだ低いのか、食べ物のオリーブがヒットしやすいので、拡散することも踏まえてカンタンに記事にしてみました。

Oliveとは？

生成AIのファインチューニング（特定の用途用に学習）ができる、オープンソースのモデル最適化ツールです。

ちなみにOlive = ONNX LIVEだそうです。

私的に良いなと感じたOliveの特徴です。

CLIでカンタンに実行（ex. olive quantize、 olive auto-opt、 olive finetune）
データセットのロードおよび学習時のコンピューティングリソースにローカルとクラウドの両方を選択できる
設定ファイル（.yaml/.json）があれば、olive runだけでファインチューニング＆量子化＆最適化が実施可能
ファインチューニングアルゴリズムは、LoRA、QLoRAを選択可能
基本的な入力はPyTorchモデル / Hugging Face モデル、出力はONNXモデル
Qualcomm、AMD、Nvidia、Intel などのハードウェアベンダーが提供する展開ターゲットの AI アクセラレータ (NPU、GPU、CPU) に合わせてモデルを最適化できる

以下、よく使うであろうコマンドの紹介です。
まんま、ドキュメントにあるコードと同じです💦

インストール

Pythonのvenv、もしくはconda仮想環境で以下のコマンドを実行します。

pip install olive-ai[gpu,finetune]
pip install transformers==4.44.2

自動最適化

HuggingFaceにあるモデルを自動最適化する際は、あらかじめログインしておきます。

huggingface-cli login --token {TOKEN}

Llama-3.2-1B-Instruct を自動的にダウンロードして最適化する auto-opt コマンドを実行します。モデルがダウンロードされると、Olive はモデルを CPUデバイス用にONNX 形式に変換し、量子化 (int4) し、グラフを最適化します。

olive auto-opt \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --trust_remote_code \
    --output_path models/llama/ao \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --precision int4 \
    --log_level 1

推論する際は、事例として以下のapp.pyを実行します。
onnxruntime_genaiを使用して、ONNX形式に変換されたモデルを推論します。

import onnxruntime_genai as og
import numpy as np
import os

model_folder = "models/llama/ao/model"

# Load the base model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

text = input("Input: ")

# Keep asking for input phrases
while text != "exit":
    if not text:
        print("Error, input cannot be empty")
        exit

    # generate prompt (prompt template + input)
    prompt = f'{chat_template.format(input=text)}'

    # encode the prompt using the tokenizer
    input_tokens = tokenizer.encode(prompt)

    params = og.GeneratorParams(model)
    params.set_search_options(**search_options)
    params.input_ids = input_tokens
    generator = og.Generator(model, params)

    print("Output: ", end='', flush=True)
    # stream the output
    try:
        while not generator.is_done():
            generator.compute_logits()
            generator.generate_next_token()

            new_token = generator.get_next_tokens()[0]
            print(tokenizer_stream.decode(new_token), end='', flush=True)
    except KeyboardInterrupt:
        print("  --control+c pressed, aborting generation--")

    print()
    text = input("Input: ")

ファインチューニング

Hugging Faceモデルを元にphrase_classificationデータセットで、LoRAでファインチューニングを実施します。

olive finetune \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --trust_remote_code \
    --output_path models/llama/ft \
    --data_name xxyyzzz/phrase_classification \
    --text_template "<|start_header_id|>user<|end_header_id|>\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{tone}" \
    --method lora \
    --max_steps 100 \
    --log_level 1

量子化

量子化手法は以下が選択できます。
量子化手法により、入出力のモデル形式が変わるため注意する必要があります。

Llama-3.2-1B-InstructモデルでAWQ量子化を実行します。

olive quantize \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --algorithm awq \
    --output_path models/llama/awq \
    --log_level 1

ワークフロー（Pass）

以下の一連の処理を一度のolive runコマンドで実行できます。

実行前にまず、YAML ファイルで「クイックスタートワークフロー」を定義します。
passesの中に一連の処理がまとまっています。

# quickstart-workflow.yaml
input_model:
  type: HfModel
  model_path: meta-llama/Llama-3.2-1B-Instruct
systems:
  local_system:
    type: LocalSystem
    accelerators:
      - device: cpu
        execution_providers:
          - CPUExecutionProvider
data_configs:
  - name: transformer_token_dummy_data
    type: TransformersTokenDummyDataContainer
passes:
  conversion:
    type: OnnxConversion
    target_opset: 16
    save_as_external_data: true
    all_tensors_to_one_file: true
    save_metadata_for_token_generation: true
  quantize:
    type: IncDynamicQuantization
  session_params_tuning:
    type: OrtSessionParamsTuning
    data_config: transformer_token_dummy_data
    io_bind: true
packaging_config:
  - type: Zipfile
    name: OutputModel
log_severity_level: 0
host: local_system
target: local_system
cache_dir: cache
output_dir: null

.yamlを用意したら、あとは以下のコマンドでワークフローのすべての処理を一度に実行できます。

olive run --config quickstart-workflow.yaml

まとめ

いかがでしたでしょうか？

Oliveを使えば、最適化に関する一連の処理が一度にできてしまうので、とても便利だと思います。

とくにエッジデバイス関連の業務ではお世話になる機会が多くなると思われます。

ヘッドウォータース

株式会社ヘッドウォータースのテックブログです。 AIエージェント、生成AI、LLM、Azureのサービスや資格、IoT、XR系などData&AIとApp modernizeに関して幅広く投稿します！