
Stable DiffusionをCore MLモデルに変換する


Core ML Stable Diffusion (ml-stable-diffusion) を用いてStable DiffusionのモデルをCore ML形式に変換する手順やTipsについてまとめる。



conda create -n coreml_stable_diffusion python=3.8 -y
conda activate coreml_stable_diffusion


cd /path/to/cloned/ml-stable-diffusion/repository
pip install -e .

モデルをCore ML形式に変換する

torch2coreml モジュールを利用してStable DiffusionのモデルをCore ML形式に変換する。


python -m python_coreml_stable_diffusion.torch2coreml \
  --convert-unet \
  --convert-text-encoder \
  --convert-vae-decoder \
  --model-version <model-version-string-from-hub> \
  -o <output-mlpackages-directory>

これで mlpackage<output-mlpackages-directory> に出力される。



Apple Siliconでのパフォーマンスを最適化するために --attention-implementation, --compute-unit あたりが必要となり、iOS(モバイル)向けに --chunk-unet が必要となり、コンパイル済みモデルを得るために --bundle-resources-for-swift-cli が必要となるので、iOS向けの現実的な最小コマンドとしてはこんな感じになる(v2.1を SPLIT_EINSUM_V2 で変換する場合):

python -m python_coreml_stable_diffusion.torch2coreml \
  --convert-unet \
  --convert-text-encoder \
  --convert-vae-decoder \
  --chunk-unet \
  --attention-implementation SPLIT_EINSUM_V2 \
  --compute-unit CPU_AND_NE \
  --bundle-resources-for-swift-cli \
  --model-version stabilityai/stable-diffusion-2-1 \
  -o models/stable-diffusion-2-1/split_einsum_v2

--chunk-unet オプション



Splits the Unet model in two approximately equal chunks (each with less than 1GB of weights) for mobile-friendly deployment. This is required for Neural Engine deployment on iOS and iPadOS if weights are not quantized to 6-bits or less (--quantize-nbits {2,4,6}). This is not required for macOS. Swift CLI is able to consume both the chunked and regular versions of the Unet model but prioritizes the former. Note that chunked unet is not compatible with the Python pipeline because Python pipeline is intended for macOS only.
(モバイルフレンドリーなデプロイのために、Unetモデルを2つのほぼ等しいチャンク(それぞれが1GB未満の重み)に分割します。iOSとiPadOSのNeural Engineのデプロイでは、重みが6ビット以下に量子化されていない場合、これは必須です(--quantize-nbits {2,4,6})。これは macOS では必要ありません。Swift CLI は Unet モデルのチャンクされたバージョンと通常のバージョンの両方を消費できますが、前者を優先します。PythonパイプラインはmacOSのみを対象としているため、チャンクされたunetはPythonパイプラインと互換性がないことに注意してください。)

--attention-implementation オプション

デフォルトは SPLIT_EINSUM。他に ORIGINALSPLIT_EINSUM_V2 が指定できる。SPLIT_EINSUM_V2 はml-stable-diffusion のv1.0.0で追加されたアテンション実装で、Neural Engineのパフォーマンスを30%アップする。



Defaults to SPLIT_EINSUM which is the implementation described in Deploying Transformers on the Apple Neural Engine. --attention-implementation SPLIT_EINSUM_V2 yields 10-30% improvement for mobile devices, still targeting the Neural Engine. --attention-implementation ORIGINAL will switch to an alternative implementation that should be used for CPU or GPU deployment on some Mac devices. Please refer to the Performance Benchmark section for further guidance.
(デフォルトはSPLIT_EINSUMで、Deploying Transformers on the Apple Neural Engineで説明されている実装です。attention-implementationのSPLIT_EINSUM_V2はモバイルデバイス向けに10-30%の改善をもたらしますが、やはりNeural Engineをターゲットにしています。attention-implementationのORIGINALは、一部のMacデバイスでCPUまたはGPUの展開に使用されるべき代替実装に切り替わります。さらなるガイダンスについては、パフォーマンスベンチマークのセクションを参照してください。)

--bundle-resources-for-swift-cli オプション

4つのモデル全てをコンパイルし、テキストトークナイゼーションに必要なリソースと一緒に <output-mlpackages-directory>/Resources にバンドルする。


python -m python_coreml_stable_diffusion.torch2coreml \
  --xl-version \
  --convert-unet \
  --convert-vae-decoder \
  --convert-text-encoder \
  --chunk-unet \
  --bundle-resources-for-swift-cli \
  --attention-implementation SPLIT_EINSUM \
  --compute-unit CPU_AND_NE \
  --latent-h 96 --latent-w 96 \
  --model-version stabilityai/stable-diffusion-xl-base-1.0 \
  -o models/stable-diffusion-xl-base-1.0/split_einsum


  • --xl-version を指定する
  • --attention-implementationSPLIT_EINSUM_V2 ではなく SPLIT_EINSUM を利用する

--xl-version オプション


Additional argument to pass to the conversion script when specifying an XL model

`--xl-version` オプションについて

--xl-version オプションをつけないと、画像生成時に
stable-diffusion パッケージの Unet.swift のここでクラッシュする

var latentTimeIdDescription: MLFeatureDescription {
    try! models.first!.perform { model in

torch2coreml のコードを読むと、


当該オプションが指定された場合に time_ids を追加するような実装がある

if args.xl_version:
    if ...:
        add_time_ids = list(original_size + crops_coords_top_left + (aesthetic_score,))
        add_neg_time_ids = list(original_size + crops_coords_top_left + (negative_aesthetic_score,))
        add_time_ids = list(original_size + crops_coords_top_left + target_size)
        add_neg_time_ids = list(original_size + crops_coords_top_left + target_size)

    time_ids = [

--attention-implementation オプション

XLモデルについては --attention-implementationSPLIT_EINSUM_V2 を指定することは非推奨とされている。

SPLIT_EINSUM_V2 is not recommended for Stable Diffusion XL because of prohibitively long compilation time

iOSで実行する場合には SPLIT_EINSUM を指定する。

  • --attention-implementation: ORIGINAL is recommended for cpuAndGPU for deployment on Mac
  • --attention-implementation: SPLIT_EINSUM is recommended for cpuAndNeuralEngine for deployment on iPhone & iPad

--latent-h, --latent-w オプション

--latent-h 96 --latent-w 96 は出力サイズを768x768にするための指定。

Tip: Adding --latent-h 96 --latent-w 96 is recommended for iOS and iPadOS deployment which leads to 768x768 generation as opposed to the default 1024x1024.



その他の torch2coreml で指定できるオプション


  • --refiner-version: The refiner version name as published on the Hugging Face Hub. This is optional and if specified, this argument will convert and bundle the refiner unet alongside the model unet.

  • --quantize-nbits: Quantizes the weights of unet and text_encoder models down to 2, 4, 6 or 8 bits using a globally optimal k-means clustering algorithm. By default all models are weight-quantized to 16 bits even if this argument is not specified. Please refer to [this section](#compression-6-bits-and-higher for details and further guidance on weight compression.

  • --check-output-correctness: Compares original PyTorch model's outputs to final Core ML model's outputs. This flag increases RAM consumption significantly so it is recommended only for debugging purposes.

  • --convert-controlnet: Converts ControlNet models specified after this option. This can also convert multiple models if you specify like --convert-controlnet lllyasviel/sd-controlnet-mlsd lllyasviel/sd-controlnet-depth.

  • --unet-support-controlnet: enables a converted UNet model to receive additional inputs from ControlNet. This is required for generating image with using ControlNet and saved with a different name, *_control-unet.mlpackage, distinct from normal UNet. On the other hand, this UNet model can not work without ControlNet. Please use normal UNet for just txt2img.

  • --convert-vae-encoder: not required for text-to-image applications. Required for image-to-image applications in order to map the input image to the latent space.

python_coreml_stable_diffusion.torch2coreml の help

python -m python_coreml_stable_diffusion.torch2coreml --help
usage: torch2coreml.py [-h] [--convert-text-encoder] [--convert-vae-decoder] [--convert-vae-encoder] [--convert-unet] [--convert-safety-checker] [--convert-controlnet [CONVERT_CONTROLNET ...]] --model-version MODEL_VERSION
                       [--refiner-version REFINER_VERSION] [--custom-vae-version CUSTOM_VAE_VERSION] [--compute-unit {ALL,CPU_AND_GPU,CPU_ONLY,CPU_AND_NE}] [--latent-h LATENT_H] [--latent-w LATENT_W]
                       [--text-token-sequence-length TEXT_TOKEN_SEQUENCE_LENGTH] [--text-encoder-hidden-size TEXT_ENCODER_HIDDEN_SIZE] [--attention-implementation {ORIGINAL,SPLIT_EINSUM,SPLIT_EINSUM_V2}] [-o O]
                       [--check-output-correctness] [--chunk-unet] [--quantize-nbits {1,2,4,6,8}] [--unet-support-controlnet] [--bundle-resources-for-swift-cli] [--text-encoder-vocabulary-url TEXT_ENCODER_VOCABULARY_URL]
                       [--text-encoder-merges-url TEXT_ENCODER_MERGES_URL] [--xl-version]

  -h, --help            show this help message and exit
  --convert-controlnet [CONVERT_CONTROLNET ...]
                        Converts a ControlNet model hosted on HuggingFace to coreML format. To convert multiple models, provide their names separated by spaces.
  --model-version MODEL_VERSION
                        The pre-trained model checkpoint and configuration to restore. For available versions: https://huggingface.co/models?search=stable-diffusion
  --refiner-version REFINER_VERSION
                        The pre-trained refiner model checkpoint and configuration to restore. If specified, this argument will convert and bundle the refiner unet only alongside the model unet. If you would like to convert a refiner
                        model on it's own, use the --model-version argument instead.For available versions: https://huggingface.co/models?sort=trending&search=stable-diffusion+refiner
  --custom-vae-version CUSTOM_VAE_VERSION
                        Custom VAE checkpoint to override the pipeline's built-in VAE. If specified, the specified VAE will be converted instead of the one associated to the `--model-version` checkpoint. No precision override is applied
                        when using a custom VAE.
  --compute-unit {ALL,CPU_AND_GPU,CPU_ONLY,CPU_AND_NE}
  --latent-h LATENT_H   The spatial resolution (number of rows) of the latent space. `Defaults to pipe.unet.config.sample_size`
  --latent-w LATENT_W   The spatial resolution (number of cols) of the latent space. `Defaults to pipe.unet.config.sample_size`
  --text-token-sequence-length TEXT_TOKEN_SEQUENCE_LENGTH
                        The token sequence length for the text encoder. `Defaults to pipe.text_encoder.config.max_position_embeddings`
  --text-encoder-hidden-size TEXT_ENCODER_HIDDEN_SIZE
                        The hidden size for the text encoder. `Defaults to pipe.text_encoder.config.hidden_size`
  --attention-implementation {ORIGINAL,SPLIT_EINSUM,SPLIT_EINSUM_V2}
                        The enumerated implementations trade off between ANE and GPU performance
  -o O                  The resulting mlpackages will be saved into this directory
                        If specified, compares the outputs of original PyTorch and final CoreML models and reports PSNR in dB. Enabling this feature uses more memory. Disable it if your machine runs out of memory.
  --chunk-unet          If specified, generates two mlpackages out of the unet model which approximately equal weights sizes. This is required for ANE deployment on iOS and iPadOS. Not required for macOS.
  --quantize-nbits {1,2,4,6,8}
                        If specified, quantized each model to nbits precision
                        If specified, enable unet to receive additional inputs from controlnet. Each input added to corresponding resnet output.
                        If specified, creates a resources directory compatible with the sample Swift CLI. It compiles all four models and adds them to a StableDiffusionResources directory along with a `vocab.json` and `merges.txt` for the
                        text tokenizer
  --text-encoder-vocabulary-url TEXT_ENCODER_VOCABULARY_URL
                        The URL to the vocabulary file use by the text tokenizer
  --text-encoder-merges-url TEXT_ENCODER_MERGES_URL
                        The URL to the merged pairs used in by the text tokenizer.
  --xl-version          If specified, the pre-trained model will be treated as an instantiation of `diffusers.pipelines.StableDiffusionXLPipeline` instead of `diffusers.pipelines.StableDiffusionPipeline`
