🖼️

Stable DiffusionをCore MLモデルに変換する

2024/07/17に公開

Core ML Stable Diffusion (ml-stable-diffusion) を用いてStable DiffusionのモデルをCore ML形式に変換する手順やTipsについてまとめる。

環境構築

condaで仮想環境を作る。

conda create -n coreml_stable_diffusion python=3.8 -y
conda activate coreml_stable_diffusion

必要なパッケージをインストール

cd /path/to/cloned/ml-stable-diffusion/repository
pip install -e .

ml-stable-diffusion リポジトリ配下の requirements.txt には coremltools のバージョンとして

coremltools>=7.0

とあるが、2024年7月現在、ml-stable-diffusion 環境で coremltools 7.2以上（8.0b1も含む）を使用していると torch2coreml コマンドを実行する際に

AttributeError: 'CacheDoublyLinkedList' object has no attribute 'index'

というエラーが発生する。CacheDoublyLinkedList の index なるメソッドが7.2以降ではなくなったらしい。

ワークアラウンドとしては coremltools 7.1を利用する。

モデルをCore ML形式に変換する

torch2coreml モジュールを利用してStable DiffusionのモデルをCore ML形式に変換する。

最小オプションとしてはこんな感じ：

python -m python_coreml_stable_diffusion.torch2coreml \
  --convert-unet \
  --convert-text-encoder \
  --convert-vae-decoder \
  --model-version <model-version-string-from-hub> \
  -o <output-mlpackages-directory>

これで mlpackage が <output-mlpackages-directory> に出力される。

ml-stable-diffusionリポジトリのREADMEではここに解説がある。

iOS向けおすすめオプション

Apple Siliconでのパフォーマンスを最適化するために --attention-implementation, --compute-unit あたりが必要となり、iOS（モバイル）向けに --chunk-unet が必要となり、コンパイル済みモデルを得るために --bundle-resources-for-swift-cli が必要となるので、iOS向けの現実的な最小コマンドとしてはこんな感じになる（v2.1を SPLIT_EINSUM_V2 で変換する場合）：

python -m python_coreml_stable_diffusion.torch2coreml \
  --convert-unet \
  --convert-text-encoder \
  --convert-vae-decoder \
  --chunk-unet \
  --attention-implementation SPLIT_EINSUM_V2 \
  --compute-unit CPU_AND_NE \
  --bundle-resources-for-swift-cli \
  --model-version stabilityai/stable-diffusion-2-1 \
  -o models/stable-diffusion-2-1/split_einsum_v2

`--chunk-unet` オプション

Unetモデルをチャンクに分割する。非圧縮モデルやXLモデルではチャンク化しないと起動時にとてつもなく時間がかかるか、メモリが足りずクラッシュする。

READMEの解説：

Splits the Unet model in two approximately equal chunks (each with less than 1GB of weights) for mobile-friendly deployment. This is required for Neural Engine deployment on iOS and iPadOS if weights are not quantized to 6-bits or less (--quantize-nbits {2,4,6}). This is not required for macOS. Swift CLI is able to consume both the chunked and regular versions of the Unet model but prioritizes the former. Note that chunked unet is not compatible with the Python pipeline because Python pipeline is intended for macOS only.
（モバイルフレンドリーなデプロイのために、Unetモデルを2つのほぼ等しいチャンク（それぞれが1GB未満の重み）に分割します。iOSとiPadOSのNeural Engineのデプロイでは、重みが6ビット以下に量子化されていない場合、これは必須です（--quantize-nbits {2,4,6}）。これは macOS では必要ありません。Swift CLI は Unet モデルのチャンクされたバージョンと通常のバージョンの両方を消費できますが、前者を優先します。PythonパイプラインはmacOSのみを対象としているため、チャンクされたunetはPythonパイプラインと互換性がないことに注意してください。）

`--attention-implementation` オプション

デフォルトは SPLIT_EINSUM。他に ORIGINAL と SPLIT_EINSUM_V2 が指定できる。SPLIT_EINSUM_V2 はml-stable-diffusion のv1.0.0で追加されたアテンション実装で、Neural Engineのパフォーマンスを30%アップする。

https://note.com/shu223/n/n993d575e1f16

READMEの解説：

Defaults to SPLIT_EINSUM which is the implementation described in Deploying Transformers on the Apple Neural Engine. --attention-implementation SPLIT_EINSUM_V2 yields 10-30% improvement for mobile devices, still targeting the Neural Engine. --attention-implementation ORIGINAL will switch to an alternative implementation that should be used for CPU or GPU deployment on some Mac devices. Please refer to the Performance Benchmark section for further guidance.
（デフォルトはSPLIT_EINSUMで、Deploying Transformers on the Apple Neural Engineで説明されている実装です。attention-implementationのSPLIT_EINSUM_V2はモバイルデバイス向けに10-30%の改善をもたらしますが、やはりNeural Engineをターゲットにしています。attention-implementationのORIGINALは、一部のMacデバイスでCPUまたはGPUの展開に使用されるべき代替実装に切り替わります。さらなるガイダンスについては、パフォーマンスベンチマークのセクションを参照してください。）

`--bundle-resources-for-swift-cli` オプション

4つのモデル全てをコンパイルし、テキストトークナイゼーションに必要なリソースと一緒に <output-mlpackages-directory>/Resources にバンドルする。

XLモデルの変換

python -m python_coreml_stable_diffusion.torch2coreml \
  --xl-version \
  --convert-unet \
  --convert-vae-decoder \
  --convert-text-encoder \
  --chunk-unet \
  --bundle-resources-for-swift-cli \
  --attention-implementation SPLIT_EINSUM \
  --compute-unit CPU_AND_NE \
  --latent-h 96 --latent-w 96 \
  --model-version stabilityai/stable-diffusion-xl-base-1.0 \
  -o models/stable-diffusion-xl-base-1.0/split_einsum

ポイントは

--xl-version を指定する
--attention-implementation で SPLIT_EINSUM_V2 ではなく SPLIT_EINSUM を利用する

`--xl-version` オプション

READMEでは次のように解説されている：

Additional argument to pass to the conversion script when specifying an XL model
（XLモデルを指定する際に変換スクリプトに渡す追加引数）

`--xl-version` オプションについて

--xl-version オプションをつけないと、画像生成時に
stable-diffusion パッケージの Unet.swift のここでクラッシュする

var latentTimeIdDescription: MLFeatureDescription {
    try! models.first!.perform { model in
        model.modelDescription.inputDescriptionsByName["time_ids"]!
    }
}

torch2coreml のコードを読むと、

当該オプションが指定された場合に time_ids を追加するような実装がある

if args.xl_version:
    ...
    if ...:
        ...
        add_time_ids = list(original_size + crops_coords_top_left + (aesthetic_score,))
        add_neg_time_ids = list(original_size + crops_coords_top_left + (negative_aesthetic_score,))
    else:
        add_time_ids = list(original_size + crops_coords_top_left + target_size)
        add_neg_time_ids = list(original_size + crops_coords_top_left + target_size)

    time_ids = [
        add_neg_time_ids,
        add_time_ids
    ]

`--attention-implementation` オプション

XLモデルについては --attention-implementation に SPLIT_EINSUM_V2 を指定することは非推奨とされている。

SPLIT_EINSUM_V2 is not recommended for Stable Diffusion XL because of prohibitively long compilation time

iOSで実行する場合には SPLIT_EINSUM を指定する。

--attention-implementation: ORIGINAL is recommended for cpuAndGPU for deployment on Mac

--attention-implementation: SPLIT_EINSUM is recommended for cpuAndNeuralEngine for deployment on iPhone & iPad

`--latent-h`, `--latent-w` オプション

--latent-h 96 --latent-w 96 は出力サイズを768x768にするための指定。

Tip: Adding --latent-h 96 --latent-w 96 is recommended for iOS and iPadOS deployment which leads to 768x768 generation as opposed to the default 1024x1024.

その他

その他、XLモデルの変換については、本家READMEではここに解説がある。

--refiner-version:
- Additional argument to pass to the conversion script when specifying an XL refiner model, required for "Ensemble of Expert Denoisers" inference.
Tip: Due to known float16 overflow issues in the original Stable Diffusion XL VAE, the model conversion script enforces float32 precision. Using a custom VAE version such as madebyollin/sdxl-vae-fp16-fix by @madebyollin via --custom-vae-version madebyollin/sdxl-vae-fp16-fix will restore the default float16 precision for VAE. （オリジナルのStable Diffusion XL VAEではfloat16のオーバーフロー問題が知られているため、モデル変換スクリプトはfloat32精度を強制します。madebyollinによるmadebyollin/sdxl-vae-fp16-fixのようなカスタムVAEバージョンを --custom-vae-version madebyollin/sdxl-vae-fp16-fix経由で使用すると、VAEのデフォルトのfloat16精度が復元されます。）

その他の `torch2coreml` で指定できるオプション

本家READMEより抜粋：

--refiner-version: The refiner version name as published on the Hugging Face Hub. This is optional and if specified, this argument will convert and bundle the refiner unet alongside the model unet.
--quantize-nbits: Quantizes the weights of unet and text_encoder models down to 2, 4, 6 or 8 bits using a globally optimal k-means clustering algorithm. By default all models are weight-quantized to 16 bits even if this argument is not specified. Please refer to [this section](#compression-6-bits-and-higher for details and further guidance on weight compression.
--check-output-correctness: Compares original PyTorch model's outputs to final Core ML model's outputs. This flag increases RAM consumption significantly so it is recommended only for debugging purposes.
--convert-controlnet: Converts ControlNet models specified after this option. This can also convert multiple models if you specify like --convert-controlnet lllyasviel/sd-controlnet-mlsd lllyasviel/sd-controlnet-depth.
--unet-support-controlnet: enables a converted UNet model to receive additional inputs from ControlNet. This is required for generating image with using ControlNet and saved with a different name, *_control-unet.mlpackage, distinct from normal UNet. On the other hand, this UNet model can not work without ControlNet. Please use normal UNet for just txt2img.
--convert-vae-encoder: not required for text-to-image applications. Required for image-to-image applications in order to map the input image to the latent space.

python_coreml_stable_diffusion.torch2coreml の help

python -m python_coreml_stable_diffusion.torch2coreml --help
scikit-learn version 1.5.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.3.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.1.0 is the most recent version that has been tested.
usage: torch2coreml.py [-h] [--convert-text-encoder] [--convert-vae-decoder] [--convert-vae-encoder] [--convert-unet] [--convert-safety-checker] [--convert-controlnet [CONVERT_CONTROLNET ...]] --model-version MODEL_VERSION
                       [--refiner-version REFINER_VERSION] [--custom-vae-version CUSTOM_VAE_VERSION] [--compute-unit {ALL,CPU_AND_GPU,CPU_ONLY,CPU_AND_NE}] [--latent-h LATENT_H] [--latent-w LATENT_W]
                       [--text-token-sequence-length TEXT_TOKEN_SEQUENCE_LENGTH] [--text-encoder-hidden-size TEXT_ENCODER_HIDDEN_SIZE] [--attention-implementation {ORIGINAL,SPLIT_EINSUM,SPLIT_EINSUM_V2}] [-o O]
                       [--check-output-correctness] [--chunk-unet] [--quantize-nbits {1,2,4,6,8}] [--unet-support-controlnet] [--bundle-resources-for-swift-cli] [--text-encoder-vocabulary-url TEXT_ENCODER_VOCABULARY_URL]
                       [--text-encoder-merges-url TEXT_ENCODER_MERGES_URL] [--xl-version]

options:
  -h, --help            show this help message and exit
  --convert-text-encoder
  --convert-vae-decoder
  --convert-vae-encoder
  --convert-unet
  --convert-safety-checker
  --convert-controlnet [CONVERT_CONTROLNET ...]
                        Converts a ControlNet model hosted on HuggingFace to coreML format. To convert multiple models, provide their names separated by spaces.
  --model-version MODEL_VERSION
                        The pre-trained model checkpoint and configuration to restore. For available versions: https://huggingface.co/models?search=stable-diffusion
  --refiner-version REFINER_VERSION
                        The pre-trained refiner model checkpoint and configuration to restore. If specified, this argument will convert and bundle the refiner unet only alongside the model unet. If you would like to convert a refiner
                        model on it's own, use the --model-version argument instead.For available versions: https://huggingface.co/models?sort=trending&search=stable-diffusion+refiner
  --custom-vae-version CUSTOM_VAE_VERSION
                        Custom VAE checkpoint to override the pipeline's built-in VAE. If specified, the specified VAE will be converted instead of the one associated to the `--model-version` checkpoint. No precision override is applied
                        when using a custom VAE.
  --compute-unit {ALL,CPU_AND_GPU,CPU_ONLY,CPU_AND_NE}
  --latent-h LATENT_H   The spatial resolution (number of rows) of the latent space. `Defaults to pipe.unet.config.sample_size`
  --latent-w LATENT_W   The spatial resolution (number of cols) of the latent space. `Defaults to pipe.unet.config.sample_size`
  --text-token-sequence-length TEXT_TOKEN_SEQUENCE_LENGTH
                        The token sequence length for the text encoder. `Defaults to pipe.text_encoder.config.max_position_embeddings`
  --text-encoder-hidden-size TEXT_ENCODER_HIDDEN_SIZE
                        The hidden size for the text encoder. `Defaults to pipe.text_encoder.config.hidden_size`
  --attention-implementation {ORIGINAL,SPLIT_EINSUM,SPLIT_EINSUM_V2}
                        The enumerated implementations trade off between ANE and GPU performance
  -o O                  The resulting mlpackages will be saved into this directory
  --check-output-correctness
                        If specified, compares the outputs of original PyTorch and final CoreML models and reports PSNR in dB. Enabling this feature uses more memory. Disable it if your machine runs out of memory.
  --chunk-unet          If specified, generates two mlpackages out of the unet model which approximately equal weights sizes. This is required for ANE deployment on iOS and iPadOS. Not required for macOS.
  --quantize-nbits {1,2,4,6,8}
                        If specified, quantized each model to nbits precision
  --unet-support-controlnet
                        If specified, enable unet to receive additional inputs from controlnet. Each input added to corresponding resnet output.
  --bundle-resources-for-swift-cli
                        If specified, creates a resources directory compatible with the sample Swift CLI. It compiles all four models and adds them to a StableDiffusionResources directory along with a `vocab.json` and `merges.txt` for the
                        text tokenizer
  --text-encoder-vocabulary-url TEXT_ENCODER_VOCABULARY_URL
                        The URL to the vocabulary file use by the text tokenizer
  --text-encoder-merges-url TEXT_ENCODER_MERGES_URL
                        The URL to the merged pairs used in by the text tokenizer.
  --xl-version          If specified, the pre-trained model will be treated as an instantiation of `diffusers.pipelines.StableDiffusionXLPipeline` instead of `diffusers.pipelines.StableDiffusionPipeline`
(coremltools_7_1) shuichi@Shuichi-MBP-2021 ~ %

環境構築

モデルをCore ML形式に変換する

iOS向けおすすめオプション

--chunk-unet オプション

--attention-implementation オプション

--bundle-resources-for-swift-cli オプション

XLモデルの変換

--xl-version オプション

--attention-implementation オプション

--latent-h, --latent-w オプション

その他

その他の torch2coreml で指定できるオプション

Discussion

`--chunk-unet` オプション

`--attention-implementation` オプション

`--bundle-resources-for-swift-cli` オプション

`--xl-version` オプション

`--attention-implementation` オプション

`--latent-h`, `--latent-w` オプション

その他の `torch2coreml` で指定できるオプション