🐠

Stable Diffusion を M2 Macのpipenvで動かす

2022/08/24に公開

Python

macOS

PyTorch

tech

Stable Diffusionがオープンソースで公開されたので、色々できそうだと思って試してみました。
とはいえ、手元にNvidiaのGPUなんて持っておらず、最近だとpytorchでApple Silicon搭載のGPUが使えるというのも耳にしていたのでそれでうまいことできないかと考えながらやっていきました。
また、個人的にpipenvが好きなのでcondaの代わりにこれを使ってます。

とりあえず無事コードが動くようになるまで。

1. 環境

M2のMacBook Air
- cpu: 8コア
- memory: 16GB
- GPU: 10コア
pipenv

2. 環境構築の前準備

すでにいろんな方が試されていたので、それを参考に進めていきました。

Githubからコードのcloneと事前学習済みのモデルのダウンロードをやっていきます。

https://huggingface.co/CompVis

その前にインストール諸々をする中で必要なものがあったのをbrewで入れたので書いておきます。この辺りは状況によって違うと思うので、エラーが出たらその都度見つつ必要なものを入れていけばいいのではないかと思います。

brew install cmake rust protobuf

下のエラーに対応して入れました。

cmake: invisible-watermarkでCould not find cmake executable!
rust: invisible-watermarkでProtobuf compiler not found
protobuf: transformersでcan't find Rust compiler

また、事前学習済みのモデル(今回はstable-diffusion-v-1-4-originalを利用)のダウンロードにhuggingfaceのアカウント登録とアクセス申請が必要だったので、これもしておきました。

huggingface.coでアカウント作成
huggingface.co/CompVisにアクセス
CompVis/stable-diffusion-v-1-4-originalに飛ぶ
Access repositoryを押す。

3. 環境構築

準備が整ったので、入れていきます。

コードのクローン

git clone https://github.com/CompVis/stable-diffusion
git clone https://huggingface.co/CompVis/stable-diffusion-v-1-4-original

この時、2つ目のクローンでさっき登録したhagginfaceのログイン情報の入力を求められました。

git lfsが入ってない場合、事前学習済みのモデルのクローンで実体の代わりにポインタ情報が取れてきてしまうので、コードを実行したときに_pickle.UnpicklingError: invalid load key, 'v'.みたいなことを言われてしばらくハマりました。
自分の場合は下記で解決しました。

brew install git-lfs
git lfs install
# 改めてclone
rm -rf ./stable-diffusion-v-1-4-original
git clone https://huggingface.co/CompVis/stable-diffusion-v-1-4-original

クローンしたモデルは大体4GBくらいなので、それがちゃんとあるか確認しておくといいかもです。

python環境の準備

CompVis/stable-diffusionのenvironment.yamlを参考に入れていきます。
cloneした後にcdなどせずにインストールしていってますので、その辺りは適宜読み替えて実行していってください。

pipenv --python 3.8 
pipenv install torch torchvision torchaudio
pipenv install albumentations opencv-python pudb invisible-watermark imageio imageio-ffmpeg pytorch-lightning omegaconf test-tube streamlit einops torch-fidelity transformers torchmetrics kornia
pipenv install -e "git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers"
pipenv install -e "git+https://github.com/openai/CLIP.git@main#egg=clip"
pipenv install -e ./stable-diffusion

ここで、なぜかpipenvで入れられなくてエラーが出てしまったので、それらのパッケージを個別で入れました

pipenv run pip install certifi 
pipenv run pip install filelock

4. Apple SiliconのGPUを使うようにする

すでにIssueで議論がされてたこれを参考に修正しました。

主な修正ファイルは下記の4つです。

stable-diffusionのディレクトリ
- ./stable-diffusion/scripts/txt2img.py
- ./stable-diffusion/ldm/models/diffusion/plms.py
- ./configs/stable-diffusion/v1-inference.yaml (これは編集しなくてもいいかもしれない。)
pipで入れたtorchのファイル
- .venv/lib/python3.8/site-packages/torch/nn/functional.py

.cuda()などを使うとTorch not compiled with CUDA enabledと出てしまうので、その辺りをmpsで置き換え。
autocastじゃなくてnullcontextを使いたいのでそこも雑に変更
- autocastの引数はcpuかcudaにしか対応してないみたいでした。

scripts/txt2img.py

git diff ./scripts/txt2img.py
diff --git a/scripts/txt2img.py b/scripts/txt2img.py
index 59c16a1..14729a8 100644
--- a/scripts/txt2img.py
+++ b/scripts/txt2img.py
@@ -60,7 +60,8 @@ def load_model_from_config(config, ckpt, verbose=False):
         print("unexpected keys:")
         print(u)
 
-    model.cuda()
+    # model.cuda()
+    model.to("mps")
     model.eval()
     return model
 
@@ -239,7 +240,8 @@ def main():
     config = OmegaConf.load(f"{opt.config}")
     model = load_model_from_config(config, f"{opt.ckpt}")
 
-    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    # device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
     model = model.to(device)
 
     if opt.plms:
@@ -279,7 +281,8 @@ def main():
 
     precision_scope = autocast if opt.precision=="autocast" else nullcontext
     with torch.no_grad():
-        with precision_scope("cuda"):
+        # with precision_scope("cuda"):
+        with nullcontext("mps"):
             with model.ema_scope():
                 tic = time.time()
                 all_samples = list()

.cuda()などを使うとTorch not compiled with CUDA enabledと出てしまうので、その辺りをmpsで置き換え。

ldm/models/diffusion/plms.py

git diff ./ldm/models/diffusion/plms.py
diff --git a/ldm/models/diffusion/plms.py b/ldm/models/diffusion/plms.py
index 78eeb10..a2b44d1 100644
--- a/ldm/models/diffusion/plms.py
+++ b/ldm/models/diffusion/plms.py
@@ -17,8 +17,8 @@ class PLMSSampler(object):
 
     def register_buffer(self, name, attr):
         if type(attr) == torch.Tensor:
-            if attr.device != torch.device("cuda"):
-                attr = attr.to(torch.device("cuda"))
+            if attr.device != torch.device("mps"):
+                attr = attr.to(torch.float32).to(torch.device("mps")).contiguous()
         setattr(self, name, attr)

必要なのかどうかは不明だが、txt2img.pyで使ってるFrozenCLIPEmbedderのデフォルト引数がdevice=cudaになってるので、それを変更するために追加。

configs/stable-diffusion/v1-inference.yaml

git diff ./configs/stable-diffusion/v1-inference.yaml
diff --git a/configs/stable-diffusion/v1-inference.yaml b/configs/stable-diffusion/v1-inference.yaml
index d4effe5..b448a62 100644
--- a/configs/stable-diffusion/v1-inference.yaml
+++ b/configs/stable-diffusion/v1-inference.yaml
@@ -68,3 +68,5 @@ model:
 
     cond_stage_config:
       target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
+      params: # edit
+        device: mps # edit

torch側がまだ対応できてないためか、view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.とエラーが出るので、contiguousを挟むことで対応
こんな感じでpipで入れたライブラリに手を加えるのは他に影響が出そうで気が引けるが、pipenvで入れてるとその辺りを感じなくて済む。

.venv/lib/python3.8/site-packages/torch/nn/functional.py

# diffの結果
@@ -2500,7 +2500,7 @@
         return handle_torch_function(
             layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
         )
-    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
+    return torch.layer_norm(input.contiguous(), normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) # edit

5. 実行

いざ実行してみると下のようなエラーが出る

NotImplementedError: The operator 'aten::index.Tensor' is not current implemented for the MPS device. 
If you want this op to be added in priority during the prototype phase of this feature, 
please comment on https://github.com/pytorch/pytorch/issues/77764. 
As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` 
to use the CPU as a fallback for this op. 
WARNING: this will be slower than running natively on MPS.

おすすめされるようにPYTORCH_ENABLE_MPS_FALLBACK=1を設定して実行するとうまくいった。

pipenv shell
cd stable-diffusion
PYTORCH_ENABLE_MPS_FALLBACK=1 python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms

# 以下出力
Global seed set to 42
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 470000
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
... # 省略
Data shape for PLMS sampling is (3, 4, 64, 64)
Running PLMS Sampling with 50 timesteps
PLMS Sampler: 100%|███████████████| 50/50 [12:59<00:00, 15.60s/it]
data: 100%|███████████████| 1/1 [13:10<00:00, 790.16s/it]
Sampling:  50%|██████████     | 1/2 [13:10<13:10, 790.16s/it
Data shape for PLMS sampling is (3, 4, 64, 64)
Running PLMS Sampling with 50 timesteps
PLMS Sampler: 100%|███████████████| 50/50 [16:11<00:00, 19.42s/it]
data: 100%|███████████████| 1/1 [16:34<00:00, 994.56s/it]
Sampling: 100%|███████████████| 2/2 [29:44<00:00, 892.36s/it]
Your samples are ready and waiting for you here: 
outputs/txt2img-samples 
 
Enjoy.

大体実行時間30分くらいでREADMEにあるような画像が出てきました。
すごい世の中になりましたね。
実行中にGPUの状態を確認して、ちゃんとGPUを使ってやってくれてるようでした。
ただし思ったよりも時間がかかってるような気もして、もう少し高速化する方法はあるのだろうか。
間違えた部分とかあれば適宜修正していきます。

6. 実行時間の短縮

cudaを使った場合にOut of Memoryになっちゃう時はn_samplesを減らしてやればメモリに乗るようになるという話を聞いて、こっちでも効果があるんじゃないか?(実行時間が短くなるんじゃ?)と思って試したところ、見事に効果がありました。
n_samples 1で実行してみます。

PYTORCH_ENABLE_MPS_FALLBACK=1 python scripts/txt2img_mac.py --prompt "a photograph of an astronaut riding a horse" --plms --n_samples 1

# 以下出力
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 470000
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
...(省略)
Data shape for PLMS sampling is (1, 4, 64, 64)
Running PLMS Sampling with 50 timesteps
PLMS Sampler: 100%|███████████████| 50/50 [01:32<00:00,  1.86s/it]
data: 100%|███████████████| 1/1 [01:38<00:00, 98.14s/it]
Sampling:  50%|██████████     | 1/2 [01:38<01:38, 98.14s/it
Data shape for PLMS sampling is (1, 4, 64, 64)
Running PLMS Sampling with 50 timesteps
PLMS Sampler: 100%|███████████████| 50/50 [01:34<00:00,  1.88s/it]
data: 100%|███████████████| 1/1 [01:38<00:00, 98.81s/it]
Sampling: 100%|███████████████| 2/2 [03:16<00:00, 98.48s/it]
Your samples are ready and waiting for you here: 
outputs/txt2img-samples 
 
Enjoy.

PLMS Sampler: 50/50のところが15~20s/it→1.8s/itくらいに短縮されて、
全体では30分くらいが3.5分くらいに短縮されました。
これくらいの時間でできるならたくさん試して遊べそうですね!

余談

ライブラリの中でinvisible-watermarkを使ってましたが、これの理由として

https://github.com/CompVis/stable-diffusion#reference-sampling-script
an invisible watermarking of the outputs, to help viewers identify the images as machine-generated.

と記載があり、個人的にはここも興味深かったです。最先端の画像生成系のモデルに明るくないのでわからないですが、機械学習モデルで生成されたとわかるようにウォーターマークを入れておくというのは面白いですね。
そこらへんに転がってる画像でウォーターマークが入ってるものがないか調べてみたくなりました。

やり終わった後に気がつきましたが、Apple Siliconでもう記事にされてる方がいらっしゃいました。みなさん手が早い。。